EDA Project

The EDA project in this course has four main parts to it:
1. Project Proposal 2. Phase 1 3. Phase 2 4. Report This notebook will be used for Project Proposal, Phase 1, and Phase 2. You will have specific questions to answer within this notebook for Project Proposal and Phase 1. You will also continue using this notebook for Phase 2. However, guidance and expectations can be found on Canvas for that assignment. The report is completed outside of this notebook (delivered as a PDF). Detailed instructions for that assignment are provided in Canvas.
Read this before proceeding: 1. Review the list of data sets and sources of data to avoid before choosing your data. This list is provided in the instructions for the Project Proposal assignment in Canvas.

2. It is expected that when you are asked questions requiring typed explanations you are to use a markdown cell to type your answers neatly. Do not provide typed answers to questions as extra comments within your code. Only provide comments within your code as you normally would, i.e. as needed to explain or remind yourself what each part of the code is doing.

Project Proposal

The intent of this assignment is for you to share your chosen data file(s) with your instructor and provide general information on your goals for the EDA project.
Step 1 (2 pts): Give a brief description of the source(s) of your data and include a direct link to your data.

I am using data can be found at https://www.basketball-reference.com/leagues/NBA_2020_totals.html

The data contains individual players statistics in the NBA. It has 30 different attributes to measure overall players statistics for all the regular NBA seasons from the year 2015 to 2020.

Step 2 (2 pts): Briefly explain why you chose this data.

I have been watching NBA Basketball games since childhood. A huge fan of Chicago Bulls team, which led to interest of choosing the overall NBA Players Statistics data. My initial intuition is that in past few years NBA players have been scoring more points in 3-pointers shooting categories and are more likely to be in teams that make the NBA playoffs and win the Larry O'Brien Championship Trophy.

Step 3 (1 pt): Provide a brief overview of your goals for this project.

The goal of this project is to confirm my hunch is either correct or incorrect regarding shooting 3-pointers is a more optimal solution compared to 2-pointers. I suspect that based on my initial intuition NBA players scoring more in 3-pointers offensive categories are more likely to lead points total and be in teams that make NBA playoffs. To answer my hypothetical question we need to analyze the data from the 2015-2020 NBA seasons, and further research into other attributes of the player's statistics such as Position, Team, Games, Minutes Played, and Field Goal Percentages. As the most current NBA Player Statistics Data does not contain the Playoffs information I will be creating reference data of teams that reached the playoffs each year to merge with the current dataset.

Step 4 (1 pt): Read the data into this notebook.
In [1]:
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

nba_stats_season1 = pd.read_csv('nba_stats_2015_2016.csv')
Step 5 (1 pt): Inspect the data using the info( ), head( ), and tail( ) methods.
In [2]:
# TODO: Use the info() method to determine to inspect the variable (column) names, the number of non-null values,
#       and the data types for each variable.

# TODO: Use the head() method to inspect the first five (or more) rows of the data

# TODO: Use the tail() method to inspect the last five (or more) rows of the data
nba_stats_season1.info()
nba_stats_season1.head()
nba_stats_season1.tail()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 578 entries, 0 to 577
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Rk      578 non-null    int64  
 1   Player  578 non-null    object 
 2   Pos     578 non-null    object 
 3   Age     578 non-null    int64  
 4   Tm      578 non-null    object 
 5   G       578 non-null    int64  
 6   GS      578 non-null    int64  
 7   MP      578 non-null    int64  
 8   FG      578 non-null    int64  
 9   FGA     578 non-null    int64  
 10  FG%     575 non-null    float64
 11  3P      578 non-null    int64  
 12  3PA     578 non-null    int64  
 13  3P%     522 non-null    float64
 14  2P      578 non-null    int64  
 15  2PA     578 non-null    int64  
 16  2P%     570 non-null    float64
 17  eFG%    575 non-null    float64
 18  FT      578 non-null    int64  
 19  FTA     578 non-null    int64  
 20  FT%     554 non-null    float64
 21  ORB     578 non-null    int64  
 22  DRB     578 non-null    int64  
 23  TRB     578 non-null    int64  
 24  AST     578 non-null    int64  
 25  STL     578 non-null    int64  
 26  BLK     578 non-null    int64  
 27  TOV     578 non-null    int64  
 28  PF      578 non-null    int64  
 29  PTS     578 non-null    int64  
dtypes: float64(5), int64(22), object(3)
memory usage: 135.6+ KB
Out[2]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Quincy Acy PF 25 SAC 59 29 876 119 214 ... 0.735 65 123 188 27 29 24 27 103 307
1 2 Jordan Adams SG 21 MEM 2 0 15 2 6 ... 0.600 0 2 2 3 3 0 2 2 7
2 3 Steven Adams C 22 OKC 80 80 2014 261 426 ... 0.582 219 314 533 62 42 89 84 223 636
3 4 Arron Afflalo SG 30 NYK 71 57 2371 354 799 ... 0.840 23 243 266 144 25 10 82 142 909
4 5 Alexis Ajinça C 27 NOP 59 17 861 150 315 ... 0.839 75 194 269 31 19 36 54 134 352

5 rows × 30 columns

Out[2]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
573 472 Joe Young PG 23 IND 41 0 384 62 169 ... 0.800 6 44 50 65 15 0 33 30 154
574 473 Nick Young SG 30 LAL 54 2 1033 126 372 ... 0.829 14 83 97 34 23 7 30 50 392
575 474 Thaddeus Young PF 27 BRK 73 73 2407 495 963 ... 0.644 176 484 660 136 112 37 136 182 1102
576 475 Cody Zeller C 23 CHO 73 60 1774 231 437 ... 0.754 138 317 455 71 57 63 68 204 638
577 476 Tyler Zeller C 26 BOS 60 3 710 138 290 ... 0.815 62 116 178 29 10 22 46 97 364

5 rows × 30 columns

In [3]:
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

nba_stats_season2 = pd.read_csv('nba_stats_2016_2017.csv')
In [4]:
nba_stats_season2.info()
nba_stats_season2.head()
nba_stats_season2.tail()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 595 entries, 0 to 594
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Rk      595 non-null    int64  
 1   Player  595 non-null    object 
 2   Pos     595 non-null    object 
 3   Age     595 non-null    int64  
 4   Tm      595 non-null    object 
 5   G       595 non-null    int64  
 6   GS      595 non-null    int64  
 7   MP      595 non-null    int64  
 8   FG      595 non-null    int64  
 9   FGA     595 non-null    int64  
 10  FG%     593 non-null    float64
 11  3P      595 non-null    int64  
 12  3PA     595 non-null    int64  
 13  3P%     549 non-null    float64
 14  2P      595 non-null    int64  
 15  2PA     595 non-null    int64  
 16  2P%     590 non-null    float64
 17  eFG%    593 non-null    float64
 18  FT      595 non-null    int64  
 19  FTA     595 non-null    int64  
 20  FT%     571 non-null    float64
 21  ORB     595 non-null    int64  
 22  DRB     595 non-null    int64  
 23  TRB     595 non-null    int64  
 24  AST     595 non-null    int64  
 25  STL     595 non-null    int64  
 26  BLK     595 non-null    int64  
 27  TOV     595 non-null    int64  
 28  PF      595 non-null    int64  
 29  PTS     595 non-null    int64  
dtypes: float64(5), int64(22), object(3)
memory usage: 139.6+ KB
Out[4]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Álex Abrines SG 23 OKC 68 6 1055 134 341 ... 0.898 18 68 86 40 37 8 33 114 406
1 2 Quincy Acy PF 26 TOT 38 1 558 70 170 ... 0.750 20 95 115 18 14 15 21 67 222
2 2 Quincy Acy PF 26 DAL 6 0 48 5 17 ... 0.667 2 6 8 0 0 0 2 9 13
3 2 Quincy Acy PF 26 BRK 32 1 510 65 153 ... 0.754 18 89 107 18 14 15 19 58 209
4 3 Steven Adams C 23 OKC 80 80 2389 374 655 ... 0.611 281 332 613 86 89 78 146 195 905

5 rows × 30 columns

Out[4]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
590 482 Cody Zeller C 24 CHO 62 58 1725 253 443 ... 0.679 135 270 405 99 62 58 65 189 639
591 483 Tyler Zeller C 27 BOS 51 5 525 78 158 ... 0.564 43 81 124 42 7 21 20 61 178
592 484 Stephen Zimmerman C 20 ORL 19 0 108 10 31 ... 0.600 11 24 35 4 2 5 3 17 23
593 485 Paul Zipser SF 22 CHI 44 18 843 88 221 ... 0.775 15 110 125 36 15 16 40 78 240
594 486 Ivica Zubac C 19 LAL 38 11 609 126 238 ... 0.653 41 118 159 30 14 33 30 66 284

5 rows × 30 columns

In [5]:
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

nba_stats_season3 = pd.read_csv('nba_stats_2017_2018.csv')
In [6]:
nba_stats_season3.info()
nba_stats_season3.head()
nba_stats_season3.tail()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 664 entries, 0 to 663
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Rk      664 non-null    int64  
 1   Player  664 non-null    object 
 2   Pos     664 non-null    object 
 3   Age     664 non-null    int64  
 4   Tm      664 non-null    object 
 5   G       664 non-null    int64  
 6   GS      664 non-null    int64  
 7   MP      664 non-null    int64  
 8   FG      664 non-null    int64  
 9   FGA     664 non-null    int64  
 10  FG%     660 non-null    float64
 11  3P      664 non-null    int64  
 12  3PA     664 non-null    int64  
 13  3P%     599 non-null    float64
 14  2P      664 non-null    int64  
 15  2PA     664 non-null    int64  
 16  2P%     646 non-null    float64
 17  eFG%    660 non-null    float64
 18  FT      664 non-null    int64  
 19  FTA     664 non-null    int64  
 20  FT%     606 non-null    float64
 21  ORB     664 non-null    int64  
 22  DRB     664 non-null    int64  
 23  TRB     664 non-null    int64  
 24  AST     664 non-null    int64  
 25  STL     664 non-null    int64  
 26  BLK     664 non-null    int64  
 27  TOV     664 non-null    int64  
 28  PF      664 non-null    int64  
 29  PTS     664 non-null    int64  
dtypes: float64(5), int64(22), object(3)
memory usage: 155.8+ KB
Out[6]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Álex Abrines SG 24 OKC 75 8 1134 115 291 ... 0.848 26 88 114 28 38 8 25 124 353
1 2 Quincy Acy PF 27 BRK 70 8 1359 130 365 ... 0.817 40 217 257 57 33 29 60 149 411
2 3 Steven Adams C 24 OKC 76 76 2487 448 712 ... 0.559 384 301 685 88 92 78 128 215 1056
3 4 Bam Adebayo C 20 MIA 69 19 1368 174 340 ... 0.721 118 263 381 101 32 41 66 138 477
4 5 Arron Afflalo SG 32 ORL 53 3 682 65 162 ... 0.846 4 62 66 30 4 9 21 56 179

5 rows × 30 columns

Out[6]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
659 537 Tyler Zeller C 28 BRK 42 33 703 125 229 ... 0.667 63 131 194 28 8 21 35 78 300
660 537 Tyler Zeller C 28 MIL 24 1 406 62 105 ... 0.895 47 64 111 19 7 14 12 48 141
661 538 Paul Zipser SF 23 CHI 54 12 824 81 234 ... 0.760 13 118 131 46 20 15 43 86 218
662 539 Ante Žižić C 21 CLE 32 2 214 49 67 ... 0.724 24 36 60 5 2 13 11 30 119
663 540 Ivica Zubac C 20 LAL 43 0 410 61 122 ... 0.765 45 78 123 25 8 15 26 47 161

5 rows × 30 columns

In [7]:
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

nba_stats_season4 = pd.read_csv('nba_stats_2018_2019.csv')
In [8]:
nba_stats_season4.info()
nba_stats_season4.head()
nba_stats_season4.tail()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 708 entries, 0 to 707
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Rk      708 non-null    int64  
 1   Player  708 non-null    object 
 2   Pos     708 non-null    object 
 3   Age     708 non-null    int64  
 4   Tm      708 non-null    object 
 5   G       708 non-null    int64  
 6   GS      708 non-null    int64  
 7   MP      708 non-null    int64  
 8   FG      708 non-null    int64  
 9   FGA     708 non-null    int64  
 10  FG%     702 non-null    float64
 11  3P      708 non-null    int64  
 12  3PA     708 non-null    int64  
 13  3P%     661 non-null    float64
 14  2P      708 non-null    int64  
 15  2PA     708 non-null    int64  
 16  2P%     693 non-null    float64
 17  eFG%    702 non-null    float64
 18  FT      708 non-null    int64  
 19  FTA     708 non-null    int64  
 20  FT%     665 non-null    float64
 21  ORB     708 non-null    int64  
 22  DRB     708 non-null    int64  
 23  TRB     708 non-null    int64  
 24  AST     708 non-null    int64  
 25  STL     708 non-null    int64  
 26  BLK     708 non-null    int64  
 27  TOV     708 non-null    int64  
 28  PF      708 non-null    int64  
 29  PTS     708 non-null    int64  
dtypes: float64(5), int64(22), object(3)
memory usage: 166.1+ KB
Out[8]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Álex Abrines SG 25 OKC 31 2 588 56 157 ... 0.923 5 43 48 20 17 6 14 53 165
1 2 Quincy Acy PF 28 PHO 10 0 123 4 18 ... 0.700 3 22 25 8 1 4 4 24 17
2 3 Jaylen Adams PG 22 ATL 34 1 428 38 110 ... 0.778 11 49 60 65 14 5 28 45 108
3 4 Steven Adams C 25 OKC 80 80 2669 481 809 ... 0.500 391 369 760 124 117 76 135 204 1108
4 5 Bam Adebayo C 21 MIA 82 28 1913 280 486 ... 0.735 165 432 597 184 71 65 121 203 729

5 rows × 30 columns

Out[8]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
703 528 Tyler Zeller C 29 MEM 4 1 82 16 28 ... 0.778 9 9 18 3 1 3 4 16 46
704 529 Ante Žižić C 22 CLE 59 25 1082 183 331 ... 0.705 108 212 320 53 13 22 61 113 459
705 530 Ivica Zubac C 21 TOT 59 37 1040 212 379 ... 0.802 115 247 362 63 14 51 70 137 525
706 530 Ivica Zubac C 21 LAL 33 12 516 112 193 ... 0.864 54 108 162 25 4 27 33 73 281
707 530 Ivica Zubac C 21 LAC 26 25 524 100 186 ... 0.733 61 139 200 38 10 24 37 64 244

5 rows × 30 columns

In [9]:
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

nba_stats_season5 = pd.read_csv('nba_stats_2019_2020.csv')
In [10]:
nba_stats_season5.info()
nba_stats_season5.head()
nba_stats_season5.tail()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 651 entries, 0 to 650
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Rk      651 non-null    int64  
 1   Player  651 non-null    object 
 2   Pos     651 non-null    object 
 3   Age     651 non-null    int64  
 4   Tm      651 non-null    object 
 5   G       651 non-null    int64  
 6   GS      651 non-null    int64  
 7   MP      651 non-null    int64  
 8   FG      651 non-null    int64  
 9   FGA     651 non-null    int64  
 10  FG%     649 non-null    float64
 11  3P      651 non-null    int64  
 12  3PA     651 non-null    int64  
 13  3P%     616 non-null    float64
 14  2P      651 non-null    int64  
 15  2PA     651 non-null    int64  
 16  2P%     645 non-null    float64
 17  eFG%    649 non-null    float64
 18  FT      651 non-null    int64  
 19  FTA     651 non-null    int64  
 20  FT%     618 non-null    float64
 21  ORB     651 non-null    int64  
 22  DRB     651 non-null    int64  
 23  TRB     651 non-null    int64  
 24  AST     651 non-null    int64  
 25  STL     651 non-null    int64  
 26  BLK     651 non-null    int64  
 27  TOV     651 non-null    int64  
 28  PF      651 non-null    int64  
 29  PTS     651 non-null    int64  
dtypes: float64(5), int64(22), object(3)
memory usage: 152.7+ KB
Out[10]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Steven Adams C 26 OKC 63 63 1680 283 478 ... 0.582 207 376 583 146 51 67 94 122 684
1 2 Bam Adebayo PF 22 MIA 72 72 2417 440 790 ... 0.691 176 559 735 368 82 93 204 182 1146
2 3 LaMarcus Aldridge C 34 SAS 53 53 1754 391 793 ... 0.827 103 289 392 129 36 87 74 128 1001
3 4 Kyle Alexander C 23 MIA 2 0 13 1 2 ... NaN 2 1 3 0 0 0 1 1 2
4 5 Nickeil Alexander-Walker SG 21 NOP 47 1 591 98 266 ... 0.676 9 75 84 89 17 8 54 57 267

5 rows × 30 columns

Out[10]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
646 525 Trae Young PG 21 ATL 60 60 2120 546 1249 ... 0.860 32 223 255 560 65 8 289 104 1778
647 526 Cody Zeller C 27 CHO 58 39 1341 251 479 ... 0.682 160 251 411 88 40 25 75 140 642
648 527 Tyler Zeller C 30 SAS 2 0 4 1 4 ... NaN 3 1 4 0 0 0 0 0 2
649 528 Ante Žižić C 23 CLE 22 0 221 41 72 ... 0.737 18 48 66 6 7 5 10 27 96
650 529 Ivica Zubac C 22 LAC 72 70 1326 236 385 ... 0.747 197 346 543 82 16 66 61 168 596

5 rows × 30 columns

In [11]:
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

nba_playoffs_data = pd.read_csv('nba_playoffs_data.csv')
In [12]:
nba_playoffs_data.info()
nba_playoffs_data.head()
nba_playoffs_data.tail()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Year     80 non-null     object
 1   Tm       80 non-null     object
 2   Playoff  80 non-null     object
dtypes: object(3)
memory usage: 2.0+ KB
Out[12]:
Year Tm Playoff
0 2019-2020 MIL Y
1 2019-2020 ORL Y
2 2019-2020 IND Y
3 2019-2020 MIA Y
4 2019-2020 BOS Y
Out[12]:
Year Tm Playoff
75 2015-2016 POR Y
76 2015-2016 OKC Y
77 2015-2016 DAL Y
78 2015-2016 SAS Y
79 2015-2016 MEM Y
STOP HERE for your Project Proposal assignment. Submit your (1) original data file(s) along with (2) the completed notebook up to this point, and (3) the html file for grading and approval.
Instructor Feedback and Approval (3 pts): Your instructor will provide feedback in either the cell below this or via Canvas. You can expect one of the following point values for this portion. 3 pts - if your project goals and data set are both approved.
2 pts - if your data set is approved but changes to your project goals (Step 3) are needed.
1 pt - if your project goals are approved but your data set is not approved.
0 pts - if neither your data set nor your project goals are approved.

As needed, follow your instructor's feeback and guidance to get on track for the remaining portions of the EDA project.

EDA Phase 1

The overall goal of this assignment is to take all necessary steps to inspect the quality of your data and prepare the data according to your needs. For information and resources on the process of Exploratory Data Analysis (EDA), you should explore the EDA Project Resources Module in Canvas. Once you’ve read through the information provided in that module and have a comfortable understanding of EDA using Python, complete steps 6 through 10 listed below to satisfy the requirements for your EDA Phase 1 assignment. **Remember to convert code cells provided to markdown cells for any typed responses to questions.**
Step 6 (2 pts): Begin by elaborating in more detail from the previous assignment on why you chose this data?
1. Explain what you hope to learn from this data. 2. Do you have a hunch about what this data will reveal? (The answer to this question will be used in the Introduction section of your EDA report.)

I hope to learn and find positive or negative correlation between how NBA players average scoring points in the 3 pointers and 2 pointers shooting category can affect their Teams chances of making the NBA playoffs. Also, explore and identify relationship between NBA player positions that can be used to analyze the data in the shooting category.

I have a hunch this data will reveal that NBA players shooting an average of higher 3 pointers per game have good offensive stats in categories like '3Pointers', 'FieldsPerGame', and 'TotalPointsPerGame' are typically more likely to score more to lead in 'TotalPoints' and be in 'Team' that makes 'Playoff' in a season than players who score an average of 2 pointers per game. Also, based on the NBA player's position for Shooting Guard (SG) and Point Guard (PG) are more likely to score more 3 pointers per games and leads in total points per season compare to position such Center(C), Small-Forwards (SF), and Power-Forwards (PF) are likely to score more 2 pointers per game.

Step 7 (2 pts): Discuss the popluation and the sample:
1. What is the population being represented by the data you’ve chosen? 2. What is the total sample size?

The population being represented is overall NBA Player statistics 2015-2020 also reference data to check which players are in the Teams that made the playoffs during the following seasons.

The sample size is 3196 rows and 37 columns which contains NBA players statistics and playoffs reference data from 2015-2020

Step 8 (2 pts): Describe how the data was collected. For example, is this a random sample? Are sampling weights used with the data?

The data was collected from the following website https://www.basketball-reference.com/ which is publicly available for variety of NBA stats, and our data focuses on overall NBA Players Statistics from the year 2015-2020 seasons which is downloaded from this site https://www.basketball-reference.com/leagues/NBA_2020_totals.html

This is not random sample and sampling weights are not used with the data

Step 9 (4 pts): In the Project Proposal assignment you used the info( ) method to inspect the variables, their data types, and the number of non-null values. Using that information as a guide, provide definitions of each of your variables and their corresponding data types, i.e. a data dictionary. Also indicate which variables will be used for your purposes.
Variables Definition DataType Will be Used
1 Rk Rank of the players for each season based on overall statistics Integer
2 Player Player corresponds to name of the NBA players for all each season String X
3 Pos Pos is the position of the NBA players while playing the game String X
4 Age Age corresponds to age of the NBA players String X
5 Tm Tm corresponds to team of the NBA players String X
6 G G corresponds to the number of games played by NBA player in that season Integer X
7 GS GS corresponds to the number of games started by NBA players in that season Integer
8 MP MP corresponds to the number of minutes played by NBA players in that season Integer X
9 FG The number of field goals that a NBA players have made. This includes both 2 pointers and 3 pointers Integer X
10 FGA The number of field goals that a NBA players have attempted. This includes both 2 pointers and 3 pointers Integer X
11 FG% The percentage of field goal attempts that a NBA player makes in that season Float X
12 3P The number of 3 pointers field goals that a NBA players have made in that season Integer X
13 3PA The number of 3 pointers field goals that a NBA players have attempted in that season Integer X
14 3P% The percentage of 3 pointers field goal attempts that a NBA player makes in that season Float X
15 2P The number of 2 pointers field goals that a NBA players have made in that season Integer X
16 2PA The number of 2 pointers field goals that a NBA players have attempted in that season Integer X
17 2P% The percentage of 2 pointers field goal attempts that a NBA player makes in that season Float X
18 eFG% It is effective field goal percentage that adjusts for the fact that a 3-pointer field goal is worth one more point than a 2-pointer field goal Float X
19 FT The number of free throws that a NBA players have made in that season Integer
20 FTA The number of free throws that a NBA players have attempted in that season Integer
21 FT% The percentage of free throw attempts that a NBA player makes in that season Float
22 ORB The number of offensive rebounds an NBA player has collected while they were playing on offense in that season Integer
23 DRB The number of defensive rebounds an NBA player has collected while they were playing on defense in that season Integer
24 TRB The number of total rebounds an NBA player has collected while they were playing in that season Integer
25 AST The number of assists is a pass made to another player that lead directly to a basket point Integer
26 STL The number of times an NBA defensive player takes the ball from a player on offense, while playing game in that season Integer
27 BLK A block occurs when offensive NBA player attempts a shot, and the defense player tips the ball, blocking their chance to score a point Integer
28 TOV A turnover occurs when the NBA player on offense loses the ball to the defense data is collected for each NBA player in that season Integer
29 PF The number of personal fouls an NBA player has committed in that season Integer
30 PTS The number of points scored by an NBA player in that season Integer X
31 Year The Year is reference data column to keep track of NBA players statistics from each season String X
32 Playoff The Playoff is reference data column to keep track of which NBA Players were in the team made playoffs in corresponding seasons String X
33 MinutesPerGame The MinutesPerGame is calculated reference based on the 'MinutesPlayed' divided by 'GamesPlayed' column because earlier these columns have stats per seasons. To have better analysis and interpretation of the data. Float X
34 FieldGoalsPerGame The FieldGoalsPerGame is calculated reference based on the 'FieldGoals' divided by 'GamesPlayed' column because earlier these columns have stats per seasons. To have better analysis and interpretation of the data we have created Field Goals per game. Float X
35 3PointerPerGame The 3PointerPerGame is calculated reference based on the '3Pointers' divided by 'GamesPlayed' column because earlier these columns have stats per seasons. To have better analysis and interpretation of the data. Float X
36 2PointerPerGame The 2PointerPerGame is calculated reference based on the '2Pointers' divided by 'GamesPlayed' column because earlier these columns have stats per seasons. To have better analysis and interpretation of the data. Float X
37 TotalPointsPerGame The TotalPointsPerGame is calculated reference based on the 'TotalPoints' divided by 'GamesPlayed' column because earlier these columns have stats per seasons. To have better analysis and interpretation of the data. Float X
Step 10 (10 pts): For full credit in this problem you'll want to take all necessary steps to report on the quality of the data and clean the data accordingly. Some things to consider while doing this are listed below. Depending on your data and goals, there may be additional steps needed than those listed here. 1. Are there rows with missing or inconsistent values? If so, eliminate those rows from your data where appropriate. 2. Are there any outliers or duplicate rows? If so, eliminate those rows from your data where appropriate. At each stage of cleaning the data, state how many rows were eliminated. 3. Are you using all columns (variables) in the data? If not, are you eliminating those columns? 4. Consider some type of visual display such as a boxplot to determine any outliers. Do any outliers need removed? If so, how many were removed? At each stage of cleaning the data, state how many rows were eliminated. It is good practice to get the shape of the data before and after each step in cleaning the data and add typed explanations (in separate markdown cells) of the steps taken to clean the data.
Include the rest of your work below and insert cells where needed.

The first step is adding year column to all the corresponding NBA statistics seasons dataframe. Second step is to concatenate all the NBA seasons dataframe into one large dataset assigning to NBA Players Statistics dataframe. Third step is merging nba_player_stats dataframe we made in second step above with nba_playoffs_data dataframe doing outer join on Year and Team column, so that after merge we can analyze which Teams made the playoffs in what Year, assign the new joined table to overall_nba_playoffs_stats data.

In [13]:
nba_stats_season1['Year'] = "2015-2016" #Adding Year column to this dataframe since data represents 2015-2016 stats
nba_stats_season1.head(5)

nba_stats_season2['Year'] = "2016-2017" #Adding Year column to this dataframe since data represents 2016-2017 stats
nba_stats_season2.head(5)

nba_stats_season3['Year'] = "2017-2018" #Adding Year column to this dataframe since data represents 2017-2018 stats
nba_stats_season3.head(5)

nba_stats_season4['Year'] = "2018-2019" #Adding Year column to this dataframe since data represents 2018-2019 stats
nba_stats_season4.head(5)

nba_stats_season5['Year'] = "2019-2020" #Adding Year column to this dataframe since data represents 2019-2020 stats
nba_stats_season5.head(5)
Out[13]:
Rk Player Pos Age Tm G GS MP FG FGA ... ORB DRB TRB AST STL BLK TOV PF PTS Year
0 1 Quincy Acy PF 25 SAC 59 29 876 119 214 ... 65 123 188 27 29 24 27 103 307 2015-2016
1 2 Jordan Adams SG 21 MEM 2 0 15 2 6 ... 0 2 2 3 3 0 2 2 7 2015-2016
2 3 Steven Adams C 22 OKC 80 80 2014 261 426 ... 219 314 533 62 42 89 84 223 636 2015-2016
3 4 Arron Afflalo SG 30 NYK 71 57 2371 354 799 ... 23 243 266 144 25 10 82 142 909 2015-2016
4 5 Alexis Ajinça C 27 NOP 59 17 861 150 315 ... 75 194 269 31 19 36 54 134 352 2015-2016

5 rows × 31 columns

Out[13]:
Rk Player Pos Age Tm G GS MP FG FGA ... ORB DRB TRB AST STL BLK TOV PF PTS Year
0 1 Álex Abrines SG 23 OKC 68 6 1055 134 341 ... 18 68 86 40 37 8 33 114 406 2016-2017
1 2 Quincy Acy PF 26 TOT 38 1 558 70 170 ... 20 95 115 18 14 15 21 67 222 2016-2017
2 2 Quincy Acy PF 26 DAL 6 0 48 5 17 ... 2 6 8 0 0 0 2 9 13 2016-2017
3 2 Quincy Acy PF 26 BRK 32 1 510 65 153 ... 18 89 107 18 14 15 19 58 209 2016-2017
4 3 Steven Adams C 23 OKC 80 80 2389 374 655 ... 281 332 613 86 89 78 146 195 905 2016-2017

5 rows × 31 columns

Out[13]:
Rk Player Pos Age Tm G GS MP FG FGA ... ORB DRB TRB AST STL BLK TOV PF PTS Year
0 1 Álex Abrines SG 24 OKC 75 8 1134 115 291 ... 26 88 114 28 38 8 25 124 353 2017-2018
1 2 Quincy Acy PF 27 BRK 70 8 1359 130 365 ... 40 217 257 57 33 29 60 149 411 2017-2018
2 3 Steven Adams C 24 OKC 76 76 2487 448 712 ... 384 301 685 88 92 78 128 215 1056 2017-2018
3 4 Bam Adebayo C 20 MIA 69 19 1368 174 340 ... 118 263 381 101 32 41 66 138 477 2017-2018
4 5 Arron Afflalo SG 32 ORL 53 3 682 65 162 ... 4 62 66 30 4 9 21 56 179 2017-2018

5 rows × 31 columns

Out[13]:
Rk Player Pos Age Tm G GS MP FG FGA ... ORB DRB TRB AST STL BLK TOV PF PTS Year
0 1 Álex Abrines SG 25 OKC 31 2 588 56 157 ... 5 43 48 20 17 6 14 53 165 2018-2019
1 2 Quincy Acy PF 28 PHO 10 0 123 4 18 ... 3 22 25 8 1 4 4 24 17 2018-2019
2 3 Jaylen Adams PG 22 ATL 34 1 428 38 110 ... 11 49 60 65 14 5 28 45 108 2018-2019
3 4 Steven Adams C 25 OKC 80 80 2669 481 809 ... 391 369 760 124 117 76 135 204 1108 2018-2019
4 5 Bam Adebayo C 21 MIA 82 28 1913 280 486 ... 165 432 597 184 71 65 121 203 729 2018-2019

5 rows × 31 columns

Out[13]:
Rk Player Pos Age Tm G GS MP FG FGA ... ORB DRB TRB AST STL BLK TOV PF PTS Year
0 1 Steven Adams C 26 OKC 63 63 1680 283 478 ... 207 376 583 146 51 67 94 122 684 2019-2020
1 2 Bam Adebayo PF 22 MIA 72 72 2417 440 790 ... 176 559 735 368 82 93 204 182 1146 2019-2020
2 3 LaMarcus Aldridge C 34 SAS 53 53 1754 391 793 ... 103 289 392 129 36 87 74 128 1001 2019-2020
3 4 Kyle Alexander C 23 MIA 2 0 13 1 2 ... 2 1 3 0 0 0 1 1 2 2019-2020
4 5 Nickeil Alexander-Walker SG 21 NOP 47 1 591 98 266 ... 9 75 84 89 17 8 54 57 267 2019-2020

5 rows × 31 columns

We are going concatenate all the NBA seaons dataframe where we added Year column into one large dataset assigning to NBA Players Statistics dataframe represents the following variable nba_players_stats mentioned below.

In [14]:
nba_players_stats = pd.concat([nba_stats_season1, nba_stats_season2, nba_stats_season3, nba_stats_season4, nba_stats_season5])
nba_players_stats.shape
nba_players_stats.info()
nba_players_stats.head()
Out[14]:
(3196, 31)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3196 entries, 0 to 650
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Rk      3196 non-null   int64  
 1   Player  3196 non-null   object 
 2   Pos     3196 non-null   object 
 3   Age     3196 non-null   int64  
 4   Tm      3196 non-null   object 
 5   G       3196 non-null   int64  
 6   GS      3196 non-null   int64  
 7   MP      3196 non-null   int64  
 8   FG      3196 non-null   int64  
 9   FGA     3196 non-null   int64  
 10  FG%     3179 non-null   float64
 11  3P      3196 non-null   int64  
 12  3PA     3196 non-null   int64  
 13  3P%     2947 non-null   float64
 14  2P      3196 non-null   int64  
 15  2PA     3196 non-null   int64  
 16  2P%     3144 non-null   float64
 17  eFG%    3179 non-null   float64
 18  FT      3196 non-null   int64  
 19  FTA     3196 non-null   int64  
 20  FT%     3014 non-null   float64
 21  ORB     3196 non-null   int64  
 22  DRB     3196 non-null   int64  
 23  TRB     3196 non-null   int64  
 24  AST     3196 non-null   int64  
 25  STL     3196 non-null   int64  
 26  BLK     3196 non-null   int64  
 27  TOV     3196 non-null   int64  
 28  PF      3196 non-null   int64  
 29  PTS     3196 non-null   int64  
 30  Year    3196 non-null   object 
dtypes: float64(5), int64(22), object(4)
memory usage: 799.0+ KB
Out[14]:
Rk Player Pos Age Tm G GS MP FG FGA ... ORB DRB TRB AST STL BLK TOV PF PTS Year
0 1 Quincy Acy PF 25 SAC 59 29 876 119 214 ... 65 123 188 27 29 24 27 103 307 2015-2016
1 2 Jordan Adams SG 21 MEM 2 0 15 2 6 ... 0 2 2 3 3 0 2 2 7 2015-2016
2 3 Steven Adams C 22 OKC 80 80 2014 261 426 ... 219 314 533 62 42 89 84 223 636 2015-2016
3 4 Arron Afflalo SG 30 NYK 71 57 2371 354 799 ... 23 243 266 144 25 10 82 142 909 2015-2016
4 5 Alexis Ajinça C 27 NOP 59 17 861 150 315 ... 75 194 269 31 19 36 54 134 352 2015-2016

5 rows × 31 columns

Step 1: We are merging nba_player_stats dataframe which is the concatenated dataframe with nba_playoffs_data reference dataframe which includes Year, Team, and Playoffs column shows 'Y' because all the Teams in the playoffs reference made the playoffs

Step 2: We did an outer join on Year and Team column, so that returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up based on the Year, Team and Playoffs that represents 'Y' who made the playoffs. Also, with NaNs elsewhere for the Teams that did not make the playoffs

In [15]:
overall_nba_playoffs_stats = pd.merge(nba_players_stats, nba_playoffs_data, how="outer", on=["Year", "Tm"])
overall_nba_playoffs_stats.shape
overall_nba_playoffs_stats.info()
overall_nba_playoffs_stats
Out[15]:
(3196, 32)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3196 entries, 0 to 3195
Data columns (total 32 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Rk       3196 non-null   int64  
 1   Player   3196 non-null   object 
 2   Pos      3196 non-null   object 
 3   Age      3196 non-null   int64  
 4   Tm       3196 non-null   object 
 5   G        3196 non-null   int64  
 6   GS       3196 non-null   int64  
 7   MP       3196 non-null   int64  
 8   FG       3196 non-null   int64  
 9   FGA      3196 non-null   int64  
 10  FG%      3179 non-null   float64
 11  3P       3196 non-null   int64  
 12  3PA      3196 non-null   int64  
 13  3P%      2947 non-null   float64
 14  2P       3196 non-null   int64  
 15  2PA      3196 non-null   int64  
 16  2P%      3144 non-null   float64
 17  eFG%     3179 non-null   float64
 18  FT       3196 non-null   int64  
 19  FTA      3196 non-null   int64  
 20  FT%      3014 non-null   float64
 21  ORB      3196 non-null   int64  
 22  DRB      3196 non-null   int64  
 23  TRB      3196 non-null   int64  
 24  AST      3196 non-null   int64  
 25  STL      3196 non-null   int64  
 26  BLK      3196 non-null   int64  
 27  TOV      3196 non-null   int64  
 28  PF       3196 non-null   int64  
 29  PTS      3196 non-null   int64  
 30  Year     3196 non-null   object 
 31  Playoff  1491 non-null   object 
dtypes: float64(5), int64(22), object(5)
memory usage: 824.0+ KB
Out[15]:
Rk Player Pos Age Tm G GS MP FG FGA ... DRB TRB AST STL BLK TOV PF PTS Year Playoff
0 1 Quincy Acy PF 25 SAC 59 29 876 119 214 ... 123 188 27 29 24 27 103 307 2015-2016 NaN
1 15 James Anderson SG 26 SAC 51 15 721 67 178 ... 73 86 41 21 14 42 54 179 2015-2016 NaN
2 44 Marco Belinelli SG 29 SAC 68 7 1672 245 635 ... 107 117 127 37 2 80 91 696 2015-2016 NaN
3 71 Caron Butler SF 35 SAC 17 1 176 25 59 ... 17 22 10 9 1 3 19 63 2015-2016 NaN
4 82 Omri Casspi PF 27 SAC 69 21 1880 299 622 ... 352 410 95 56 17 94 154 813 2015-2016 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3191 495 Kemba Walker PG 29 BOS 56 56 1742 378 889 ... 181 217 268 48 28 117 91 1145 2019-2020 Y
3192 499 Brad Wanamaker PG 30 BOS 71 1 1369 162 362 ... 122 144 179 61 14 76 133 487 2019-2020 Y
3193 503 Tremont Waters PG 22 BOS 11 1 119 14 49 ... 12 12 16 10 2 15 13 40 2019-2020 Y
3194 511 Grant Williams PF 21 BOS 69 5 1043 87 211 ... 119 178 68 30 36 50 163 237 2019-2020 Y
3195 516 Robert Williams C 22 BOS 29 1 388 64 88 ... 88 128 27 22 35 21 51 150 2019-2020 Y

3196 rows × 32 columns

We are checking for the rows in the dataframe below to see which Teams did not make the playoffs, these are NaNs values after the outer join on the dataframe we are making these corresponding NaNs values to Playoff = 'N' which means that these NBA players and Teams did not make the playoffs in that Year based on the historical five year data

In [16]:
overall_nba_playoffs_stats.loc[overall_nba_playoffs_stats.Playoff != "Y", "Playoff"] = "N"
overall_nba_playoffs_stats
Out[16]:
Rk Player Pos Age Tm G GS MP FG FGA ... DRB TRB AST STL BLK TOV PF PTS Year Playoff
0 1 Quincy Acy PF 25 SAC 59 29 876 119 214 ... 123 188 27 29 24 27 103 307 2015-2016 N
1 15 James Anderson SG 26 SAC 51 15 721 67 178 ... 73 86 41 21 14 42 54 179 2015-2016 N
2 44 Marco Belinelli SG 29 SAC 68 7 1672 245 635 ... 107 117 127 37 2 80 91 696 2015-2016 N
3 71 Caron Butler SF 35 SAC 17 1 176 25 59 ... 17 22 10 9 1 3 19 63 2015-2016 N
4 82 Omri Casspi PF 27 SAC 69 21 1880 299 622 ... 352 410 95 56 17 94 154 813 2015-2016 N
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3191 495 Kemba Walker PG 29 BOS 56 56 1742 378 889 ... 181 217 268 48 28 117 91 1145 2019-2020 Y
3192 499 Brad Wanamaker PG 30 BOS 71 1 1369 162 362 ... 122 144 179 61 14 76 133 487 2019-2020 Y
3193 503 Tremont Waters PG 22 BOS 11 1 119 14 49 ... 12 12 16 10 2 15 13 40 2019-2020 Y
3194 511 Grant Williams PF 21 BOS 69 5 1043 87 211 ... 119 178 68 30 36 50 163 237 2019-2020 Y
3195 516 Robert Williams C 22 BOS 29 1 388 64 88 ... 88 128 27 22 35 21 51 150 2019-2020 Y

3196 rows × 32 columns

The query below shows 1491 NBA players are in the Teams made Playoffs and 1705 NBA players are in the Teams that did not make Playoffs for NBA seasons from year 2015-2020

In [17]:
overall_nba_playoffs_stats.Playoff.str.contains("Y").sum()
overall_nba_playoffs_stats.Playoff.str.contains("N").sum()
Out[17]:
1491
Out[17]:
1705
In [18]:
overall_nba_playoffs_stats.shape
Out[18]:
(3196, 32)

Variables created that includes all columns that needed to be dropped because they are not relevant to the data analysis

In [19]:
col_to_drop = ['Rk', 'GS', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF']

In the new dataframe named overall_nba_playoffs_stats2 that uses the col_to_drop variable to eliminate all irrelevant 13 columns, and they will be dropped from the table

In [20]:
overall_nba_playoffs_stats2 = overall_nba_playoffs_stats.drop(columns=col_to_drop, inplace=False)
overall_nba_playoffs_stats2
Out[20]:
Player Pos Age Tm G MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% PTS Year Playoff
0 Quincy Acy PF 25 SAC 59 876 119 214 0.556 19 49 0.388 100 165 0.606 0.600 307 2015-2016 N
1 James Anderson SG 26 SAC 51 721 67 178 0.376 23 86 0.267 44 92 0.478 0.441 179 2015-2016 N
2 Marco Belinelli SG 29 SAC 68 1672 245 635 0.386 91 297 0.306 154 338 0.456 0.457 696 2015-2016 N
3 Caron Butler SF 35 SAC 17 176 25 59 0.424 3 18 0.167 22 41 0.537 0.449 63 2015-2016 N
4 Omri Casspi PF 27 SAC 69 1880 299 622 0.481 112 274 0.409 187 348 0.537 0.571 813 2015-2016 N
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3191 Kemba Walker PG 29 BOS 56 1742 378 889 0.425 180 473 0.381 198 416 0.476 0.526 1145 2019-2020 Y
3192 Brad Wanamaker PG 30 BOS 71 1369 162 362 0.448 37 102 0.363 125 260 0.481 0.499 487 2019-2020 Y
3193 Tremont Waters PG 22 BOS 11 119 14 49 0.286 4 24 0.167 10 25 0.400 0.327 40 2019-2020 Y
3194 Grant Williams PF 21 BOS 69 1043 87 211 0.412 24 96 0.250 63 115 0.548 0.469 237 2019-2020 Y
3195 Robert Williams C 22 BOS 29 388 64 88 0.727 0 0 NaN 64 88 0.727 0.727 150 2019-2020 Y

3196 rows × 19 columns

Next step, to show shape of the table only 19 remaining from the 32 columns for further analysis

In [21]:
overall_nba_playoffs_stats2.shape
Out[21]:
(3196, 19)

Rename columns for better understanding and easier interpretation on what each column means. The variable overall_nba_playoffs_stats2 columns used to rename the main dataframe for the remaining 19 columns. Also successfully confirmed that columns are renamed in the main dataframe

In [22]:
overall_nba_playoffs_stats2_cols = ['Player', 'Position','Age', 'Team','GamesPlayed', 'MinutesPlayed','FieldGoals', 'FieldGoalsAttempts','FieldGoals%', '3Pointers', '3Pointer_Attempts', '3Pointers%', '2Pointers', '2Pointer_Attempts', '2Pointers%', 'EffectiveFieldGoals%', 'TotalPoints', 'Year','Playoff'] 
overall_nba_playoffs_stats2.columns = overall_nba_playoffs_stats2_cols 
overall_nba_playoffs_stats2.columns
overall_nba_playoffs_stats2
Out[22]:
Index(['Player', 'Position', 'Age', 'Team', 'GamesPlayed', 'MinutesPlayed',
       'FieldGoals', 'FieldGoalsAttempts', 'FieldGoals%', '3Pointers',
       '3Pointer_Attempts', '3Pointers%', '2Pointers', '2Pointer_Attempts',
       '2Pointers%', 'EffectiveFieldGoals%', 'TotalPoints', 'Year', 'Playoff'],
      dtype='object')
Out[22]:
Player Position Age Team GamesPlayed MinutesPlayed FieldGoals FieldGoalsAttempts FieldGoals% 3Pointers 3Pointer_Attempts 3Pointers% 2Pointers 2Pointer_Attempts 2Pointers% EffectiveFieldGoals% TotalPoints Year Playoff
0 Quincy Acy PF 25 SAC 59 876 119 214 0.556 19 49 0.388 100 165 0.606 0.600 307 2015-2016 N
1 James Anderson SG 26 SAC 51 721 67 178 0.376 23 86 0.267 44 92 0.478 0.441 179 2015-2016 N
2 Marco Belinelli SG 29 SAC 68 1672 245 635 0.386 91 297 0.306 154 338 0.456 0.457 696 2015-2016 N
3 Caron Butler SF 35 SAC 17 176 25 59 0.424 3 18 0.167 22 41 0.537 0.449 63 2015-2016 N
4 Omri Casspi PF 27 SAC 69 1880 299 622 0.481 112 274 0.409 187 348 0.537 0.571 813 2015-2016 N
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3191 Kemba Walker PG 29 BOS 56 1742 378 889 0.425 180 473 0.381 198 416 0.476 0.526 1145 2019-2020 Y
3192 Brad Wanamaker PG 30 BOS 71 1369 162 362 0.448 37 102 0.363 125 260 0.481 0.499 487 2019-2020 Y
3193 Tremont Waters PG 22 BOS 11 119 14 49 0.286 4 24 0.167 10 25 0.400 0.327 40 2019-2020 Y
3194 Grant Williams PF 21 BOS 69 1043 87 211 0.412 24 96 0.250 63 115 0.548 0.469 237 2019-2020 Y
3195 Robert Williams C 22 BOS 29 388 64 88 0.727 0 0 NaN 64 88 0.727 0.727 150 2019-2020 Y

3196 rows × 19 columns

I have created 4 extra reference data column with MinutesPerGame, FieldGoalsPerGame, 3PointerPerGame, and 2PointerPerGame to have better understanding and interpretation of the data. Also, would get easier for the viewer to observe Exploratory Data Analysis done on the following columns.

Next step, to show shape of the table only 24 remaining columns for further analysis

In [23]:
overall_nba_playoffs_stats2['MinutesPerGame'] = round(overall_nba_playoffs_stats2['MinutesPlayed'] / overall_nba_playoffs_stats2['GamesPlayed'],2)
overall_nba_playoffs_stats2['FieldGoalsPerGame'] = round(overall_nba_playoffs_stats2['FieldGoals'] / overall_nba_playoffs_stats2['GamesPlayed'],0)
overall_nba_playoffs_stats2['3PointerPerGame'] = round(overall_nba_playoffs_stats2['3Pointers'] / overall_nba_playoffs_stats2['GamesPlayed'], 0)
overall_nba_playoffs_stats2['2PointerPerGame'] = round(overall_nba_playoffs_stats2['2Pointers'] / overall_nba_playoffs_stats2['GamesPlayed'], 0)
overall_nba_playoffs_stats2['TotalPointsPerGame'] = round(overall_nba_playoffs_stats2['TotalPoints'] / overall_nba_playoffs_stats2['GamesPlayed'], 0)

overall_nba_playoffs_stats2
Out[23]:
Player Position Age Team GamesPlayed MinutesPlayed FieldGoals FieldGoalsAttempts FieldGoals% 3Pointers ... 2Pointers% EffectiveFieldGoals% TotalPoints Year Playoff MinutesPerGame FieldGoalsPerGame 3PointerPerGame 2PointerPerGame TotalPointsPerGame
0 Quincy Acy PF 25 SAC 59 876 119 214 0.556 19 ... 0.606 0.600 307 2015-2016 N 14.85 2.0 0.0 2.0 5.0
1 James Anderson SG 26 SAC 51 721 67 178 0.376 23 ... 0.478 0.441 179 2015-2016 N 14.14 1.0 0.0 1.0 4.0
2 Marco Belinelli SG 29 SAC 68 1672 245 635 0.386 91 ... 0.456 0.457 696 2015-2016 N 24.59 4.0 1.0 2.0 10.0
3 Caron Butler SF 35 SAC 17 176 25 59 0.424 3 ... 0.537 0.449 63 2015-2016 N 10.35 1.0 0.0 1.0 4.0
4 Omri Casspi PF 27 SAC 69 1880 299 622 0.481 112 ... 0.537 0.571 813 2015-2016 N 27.25 4.0 2.0 3.0 12.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3191 Kemba Walker PG 29 BOS 56 1742 378 889 0.425 180 ... 0.476 0.526 1145 2019-2020 Y 31.11 7.0 3.0 4.0 20.0
3192 Brad Wanamaker PG 30 BOS 71 1369 162 362 0.448 37 ... 0.481 0.499 487 2019-2020 Y 19.28 2.0 1.0 2.0 7.0
3193 Tremont Waters PG 22 BOS 11 119 14 49 0.286 4 ... 0.400 0.327 40 2019-2020 Y 10.82 1.0 0.0 1.0 4.0
3194 Grant Williams PF 21 BOS 69 1043 87 211 0.412 24 ... 0.548 0.469 237 2019-2020 Y 15.12 1.0 0.0 1.0 3.0
3195 Robert Williams C 22 BOS 29 388 64 88 0.727 0 ... 0.727 0.727 150 2019-2020 Y 13.38 2.0 0.0 2.0 5.0

3196 rows × 24 columns

We have removed all irrelevant columns, and renamed all the remaining columns for easier interpretation. Now we need to identify all missing values in each row from the dataframe so the data can be properly analyzed. To do this step will be using 'overall_nba_playoffs_stats2.isnull().sum()' which used to check for all missing values in each column.

In [24]:
overall_nba_playoffs_stats2.shape 
overall_nba_playoffs_stats2.isnull().sum() #shows all missing values in each column
Out[24]:
(3196, 24)
Out[24]:
Player                    0
Position                  0
Age                       0
Team                      0
GamesPlayed               0
MinutesPlayed             0
FieldGoals                0
FieldGoalsAttempts        0
FieldGoals%              17
3Pointers                 0
3Pointer_Attempts         0
3Pointers%              249
2Pointers                 0
2Pointer_Attempts         0
2Pointers%               52
EffectiveFieldGoals%     17
TotalPoints               0
Year                      0
Playoff                   0
MinutesPerGame            0
FieldGoalsPerGame         0
3PointerPerGame           0
2PointerPerGame           0
TotalPointsPerGame        0
dtype: int64

For the data clean up of missing values we need to take step-by-step approach rather than using 'dropna'function on entire dataframe because based on NBA players position some players may not shoot 3Pointers vice-versa other players may not shoot 2Pointers which affects percentage for following columns above using missing values. We will create variables for all rows in each column that contain missing values, and update dataframe

All rows with missing values in both 'FieldGoals%' and 'EffectiveFieldGoals%' series removed from overall_nba_playoffs_stats2. The missing values showned for the following means that these haven't scored any points in that season. The amount of Games played were really low could be because of injury.

In [25]:
missing_fieldgoals_percentage = overall_nba_playoffs_stats2[overall_nba_playoffs_stats2['FieldGoals%'].isnull()]
missing_fieldgoals_percentage
Out[25]:
Player Position Age Team GamesPlayed MinutesPlayed FieldGoals FieldGoalsAttempts FieldGoals% 3Pointers ... 2Pointers% EffectiveFieldGoals% TotalPoints Year Playoff MinutesPerGame FieldGoalsPerGame 3PointerPerGame 2PointerPerGame TotalPointsPerGame
38 Jarnell Stokes C 22 MEM 2 4 0 0 NaN 0 ... NaN NaN 0 2015-2016 Y 2.00 0.0 0.0 0.0 0.0
219 James Ennis SF 25 MIA 3 7 0 0 NaN 0 ... NaN NaN 0 2015-2016 Y 2.33 0.0 0.0 0.0 0.0
324 Sam Dekker PF 21 HOU 3 6 0 0 NaN 0 ... NaN NaN 0 2015-2016 Y 2.00 0.0 0.0 0.0 0.0
821 Andrew Bogut C 32 CLE 1 1 0 0 NaN 0 ... NaN NaN 0 2016-2017 Y 1.00 0.0 0.0 0.0 0.0
1076 Danuel House SG 23 WAS 1 1 0 0 NaN 0 ... NaN NaN 0 2016-2017 Y 1.00 0.0 0.0 0.0 0.0
1207 Rashad Vaughn SG 21 BRK 1 4 0 0 NaN 0 ... NaN NaN 0 2017-2018 N 4.00 0.0 0.0 0.0 0.0
1396 Trey McKinney-Jones SG 27 IND 1 1 0 0 NaN 0 ... NaN NaN 0 2017-2018 Y 1.00 0.0 0.0 0.0 0.0
1397 Ben Moore PF 22 IND 2 9 0 0 NaN 0 ... NaN NaN 0 2017-2018 Y 4.50 0.0 0.0 0.0 0.0
1482 Tyler Lydon PF 21 DEN 1 2 0 0 NaN 0 ... NaN NaN 0 2017-2018 N 2.00 0.0 0.0 0.0 0.0
1871 George King SF 25 PHO 1 6 0 0 NaN 0 ... NaN NaN 0 2018-2019 N 6.00 0.0 0.0 0.0 0.0
1873 Eric Moreland PF 27 PHO 1 5 0 0 NaN 0 ... NaN NaN 0 2018-2019 N 5.00 0.0 0.0 0.0 0.0
1929 John Holland SF 30 CLE 1 1 0 0 NaN 0 ... NaN NaN 0 2018-2019 N 1.00 0.0 0.0 0.0 0.0
1941 Kobi Simmons PG 21 CLE 1 2 0 0 NaN 0 ... NaN NaN 0 2018-2019 N 2.00 0.0 0.0 0.0 0.0
2001 Tyler Ulis PG 23 CHI 1 1 0 0 NaN 0 ... NaN NaN 0 2018-2019 N 1.00 0.0 0.0 0.0 0.0
2255 Ray Spalding PF 21 DAL 1 1 0 0 NaN 0 ... NaN NaN 0 2018-2019 N 1.00 0.0 0.0 0.0 0.0
3136 Marques Bolden C 21 CLE 1 3 0 0 NaN 0 ... NaN NaN 0 2019-2020 N 3.00 0.0 0.0 0.0 0.0
3146 J.P. Macura SG 24 CLE 1 1 0 0 NaN 0 ... NaN NaN 0 2019-2020 N 1.00 0.0 0.0 0.0 0.0

17 rows × 24 columns

In [26]:
overall_nba_playoffs_stats2 = overall_nba_playoffs_stats2.drop(missing_fieldgoals_percentage.index)
overall_nba_playoffs_stats2.isnull().sum() #shows all missing values in each column
Out[26]:
Player                    0
Position                  0
Age                       0
Team                      0
GamesPlayed               0
MinutesPlayed             0
FieldGoals                0
FieldGoalsAttempts        0
FieldGoals%               0
3Pointers                 0
3Pointer_Attempts         0
3Pointers%              232
2Pointers                 0
2Pointer_Attempts         0
2Pointers%               35
EffectiveFieldGoals%      0
TotalPoints               0
Year                      0
Playoff                   0
MinutesPerGame            0
FieldGoalsPerGame         0
3PointerPerGame           0
2PointerPerGame           0
TotalPointsPerGame        0
dtype: int64

All rows with missing values in '3Pointers%'and '2Pointers%' series are checked from overall_nba_playoffs_stats2 dataframe. As mentioned earlier that based on NBA players position some players may not shoot 3Pointers vice-versa other players may not shoot 2Pointers which affects percentage for following columns above using missing values.These rows with missing values are not dropped. We will replace all missing value 0 with 'overall_play_nba_playoffs_stats2.fillna(0, inplace=True)' functionality.

In [27]:
overall_nba_playoffs_stats2.shape #used to confirm rows dropped 
overall_nba_playoffs_stats2.isnull().sum() #used to reveal remaining rows with missing values
overall_nba_playoffs_stats2.fillna(0, inplace=True)
overall_nba_playoffs_stats2.isnull().sum() #used to reveal remaining rows with missing values
overall_nba_playoffs_stats2.shape
overall_nba_playoffs_stats2.dtypes
Out[27]:
(3179, 24)
Out[27]:
Player                    0
Position                  0
Age                       0
Team                      0
GamesPlayed               0
MinutesPlayed             0
FieldGoals                0
FieldGoalsAttempts        0
FieldGoals%               0
3Pointers                 0
3Pointer_Attempts         0
3Pointers%              232
2Pointers                 0
2Pointer_Attempts         0
2Pointers%               35
EffectiveFieldGoals%      0
TotalPoints               0
Year                      0
Playoff                   0
MinutesPerGame            0
FieldGoalsPerGame         0
3PointerPerGame           0
2PointerPerGame           0
TotalPointsPerGame        0
dtype: int64
Out[27]:
Player                  0
Position                0
Age                     0
Team                    0
GamesPlayed             0
MinutesPlayed           0
FieldGoals              0
FieldGoalsAttempts      0
FieldGoals%             0
3Pointers               0
3Pointer_Attempts       0
3Pointers%              0
2Pointers               0
2Pointer_Attempts       0
2Pointers%              0
EffectiveFieldGoals%    0
TotalPoints             0
Year                    0
Playoff                 0
MinutesPerGame          0
FieldGoalsPerGame       0
3PointerPerGame         0
2PointerPerGame         0
TotalPointsPerGame      0
dtype: int64
Out[27]:
(3179, 24)
Out[27]:
Player                   object
Position                 object
Age                       int64
Team                     object
GamesPlayed               int64
MinutesPlayed             int64
FieldGoals                int64
FieldGoalsAttempts        int64
FieldGoals%             float64
3Pointers                 int64
3Pointer_Attempts         int64
3Pointers%              float64
2Pointers                 int64
2Pointer_Attempts         int64
2Pointers%              float64
EffectiveFieldGoals%    float64
TotalPoints               int64
Year                     object
Playoff                  object
MinutesPerGame          float64
FieldGoalsPerGame       float64
3PointerPerGame         float64
2PointerPerGame         float64
TotalPointsPerGame      float64
dtype: object

A boxplot is created for 'GamesPlayed' and 'Age' to reveal any significant outliers.

In [28]:
overall_nba_playoffs_stats2.boxplot(column=['GamesPlayed', 'Age'])
overall_nba_playoffs_stats2[['GamesPlayed', 'Age']].describe(percentiles = [.25, .5, .75, .95])
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x122e2be80>
Out[28]:
GamesPlayed Age
count 3179.000000 3179.000000
mean 44.413023 26.262347
std 26.067750 4.184737
min 1.000000 19.000000
25% 20.000000 23.000000
50% 48.000000 26.000000
75% 68.000000 29.000000
95% 81.000000 34.000000
max 82.000000 43.000000

Explaining Box Plot in GamesPlayed Column: The GamesPlayed column box plot displays the summary of five sets such as the lower whiskers represents minimum games played by NBA player is 1. Similarly, upper whiskers is the most games played by an NBA player is 82 per season. The lower quartile shows 25% of NBA player that played below 20 games. The upper quartile represents 75% of NBA player that played below 68 games. The inter-quartile represents average NBA player games played is 48. There are no significant outliers in the Gamesplayed box plot.

Explaining Box Plot in Age Column: The Age column box plot displays the summary of five sets such as the lower whisker is the youngest NBA player of age 19 in our data. Similarly, upper whiskers is the oldest NBA player's age of 39 in our data. The lower quartile shows 25% of NBA player's age are below the age of 23 in the data. The upper quartile shows 75% of NBA player's age are below the age of 29. The inter-quartile represents average NBA player's age is 26. The outliers in the Age column box plot are plotted as individual dots that are in-line with whiskers for instance we can see that upper extreme for the NBA player's age is 39 beyond that age are outliers which has max age of 43 which will be removed from our data.

shape of dataframe shows 3179 rows before dropping NBA players that have age greater that 39

In [29]:
print("Removing Outliers from Age Column")
overall_nba_playoffs_stats2.shape


ageover39 = overall_nba_playoffs_stats2[overall_nba_playoffs_stats2.Age > 39].index

overall_nba_playoffs_stats2.drop(ageover39, inplace=True) #function that drops all players with age above 39

overall_nba_playoffs_stats2.shape 
Removing Outliers from Age Column
Out[29]:
(3179, 24)
Out[29]:
(3172, 24)

shape of dataframe shows 3172 rows after dropping NBA players that have age greater that 39

In [30]:
overall_nba_playoffs_stats2['MinutesPerGame'].describe(percentiles = [.25, .5, .75, .95])
overall_nba_playoffs_stats2.boxplot(column=['MinutesPerGame'])
Out[30]:
count    3172.000000
mean       19.281917
std         9.042911
min         0.670000
25%        12.180000
50%        18.865000
75%        26.702500
95%        33.894500
max        42.000000
Name: MinutesPerGame, dtype: float64
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x12311ef40>

Explaining Box Plot in MinutesPerGame Column: The MinutesPerGame column box plot displays the summary of five sets such as the lower whiskers represents minimum minutes per game played by an NBA player is 0.67 minutes. Similarly, upper whiskers is the maximum minutes per game played by an NBA player is 42 minutes. The lower quartile shows 25% of NBA player that played below 12 minutes 18 seconds per game. The upper quartile represents 75% of NBA player that played below 26 minutes 70 seconds per game. The inter-quartile represents average NBA player minutes per game is 19 minutes 28 seconds. There are no significant outliers in the Gamesplayed box plot.

I was just curious to see which NBA player played lowest minutes per games in the data. I have queried that data below because it did not seem realistics to have such a lowest minutes per game stats in an NBA season

In [31]:
overall_nba_playoffs_stats2.shape 
minutespergame = overall_nba_playoffs_stats2[overall_nba_playoffs_stats2.MinutesPerGame <= 0.67]
minutespergame
Out[31]:
(3172, 24)
Out[31]:
Player Position Age Team GamesPlayed MinutesPlayed FieldGoals FieldGoalsAttempts FieldGoals% 3Pointers ... 2Pointers% EffectiveFieldGoals% TotalPoints Year Playoff MinutesPerGame FieldGoalsPerGame 3PointerPerGame 2PointerPerGame TotalPointsPerGame
1847 Donte Grantham SF 23 OKC 3 2 0 2 0.0 0 ... 0.0 0.0 0 2018-2019 Y 0.67 0.0 0.0 0.0 0.0

1 rows × 24 columns

In [32]:
overall_nba_playoffs_stats2['FieldGoalsPerGame'].describe(percentiles = [.25, .5, .75, .95])
overall_nba_playoffs_stats2.boxplot(column=['FieldGoalsPerGame'])
Out[32]:
count    3172.000000
mean        3.027743
std         2.123702
min         0.000000
25%         1.000000
50%         3.000000
75%         4.000000
95%         7.000000
max        11.000000
Name: FieldGoalsPerGame, dtype: float64
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x123214b50>

Explain Box Plot in FieldGoalsPerGame Column: The FieldGoalsPerGame column box plot displays the summary of five sets such as the lower whiskers represents minimum fields goals per game by an NBA player is 0 because these include both three and two pointer shots category statistics in this column based on position of an NBA player they shoot only 2 or 3 pointers. Similarly, upper whiskers is the maximum field goals per game by an NBA player is 8. The lower quartile shows 25% of NBA player's field goals per games are below 1. The upper quartile represents 75% of NBA player's field goals per game are below 4. The inter-quartile corresponds to the average NBA Player Fields Goals Per Game is 3. Even though there are outliers according FieldGoalsPerGame box plot I will consider keeping them because as mentioned above FieldGoals plays a big factor based NBA player's position also doing data analysis on my hypothesis

shape of dataframe is still 3172 rows after analyzing the FieldsGoalsPerGame box plot

In [33]:
overall_nba_playoffs_stats2.shape
overall_nba_playoffs_stats2.boxplot(column=['3PointerPerGame', '2PointerPerGame', 'TotalPointsPerGame'])
overall_nba_playoffs_stats2[['3PointerPerGame', '2PointerPerGame', 'TotalPointsPerGame']].describe(percentiles = [.25, .5, .75, .95])
Out[33]:
(3172, 24)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1232d13a0>
Out[33]:
3PointerPerGame 2PointerPerGame TotalPointsPerGame
count 3172.000000 3172.000000 3172.000000
mean 0.775851 2.216898 8.228562
std 0.844152 1.765750 5.840660
min 0.000000 0.000000 0.000000
25% 0.000000 1.000000 4.000000
50% 1.000000 2.000000 7.000000
75% 1.000000 3.000000 11.000000
95% 2.000000 6.000000 20.000000
max 5.000000 10.000000 36.000000

Explain Box Plot in 3PointerPerGame Column: The 3PointerPerGame column box plot displays the summary of five sets such as the lower whiskers represents minimum 3 pointers per game by an NBA player is 0 because position of an NBA player they shoot only 2 or 3 pointers. Similarly, upper whiskers is the maximum of 5 three pointers per game scored by an NBA player. The lower quartile shows 25% of NBA player's have scored below 0 three pointers per game for instance NBA position such as 'Center' or 'PowerForward mostly do not shoot 3 pointers in a game. The upper quartile represents 75% of NBA player's have scored below 1 three pointers per game. The inter-quartile corresponds to the average NBA Player score 1 three pointer per game in a regular season.

Explain Box Plot in 2PointerPerGame Column: The 2PointerPerGame column box plot displays the summary of five sets such as the lower whiskers represents minimum 2 pointers per game by an NBA player is 0 because position of an NBA player they shoot only 2 or 3 pointers. Similarly, upper whiskers is the maximum of 10 two pointers per game scored by an NBA player. The lower quartile shows 25% of NBA player's have scored below 1 two pointers per game. The upper quartile represents 75% of NBA player's have scored below 3 two pointers per game. The inter-quartile corresponds to the average NBA Player score 2 two pointers per game in a regular season.

Explain Box Plot in TotalPointsPerGame Column: The TotalPointsPerGame column box plot displays the summary of five sets such as the lower whiskers represents minimum total points per game by an NBA player is 0 because position of an NBA player they shoot only 2 or 3 pointers also there is a chance player could be injured. Similarly, upper whiskers is the maximum of 36 total points per game scored by an NBA player. The lower quartile shows 25% of NBA player's have scored below 4 total points per game. The upper quartile represents 75% of NBA player's have scored below 11 total points per game. The inter-quartile corresponds to the average NBA Player score 7 total points per games in a regular season.

To keep in mind that the following is historical data from the year 2015-2020 NBA seasons where the data can fluctuate in these columns mentioned above. Also, even though there are outliers in the columns above we will keep them because these columns above play a vital role in data analysis that shooting 3 pointers is an optimal solution over 2 pointers to make the NBA playoffs.

We are trying to reorder the columns in the table so that it looks organized for better understanding as the reviewer.

Below, I am showing the before reordering process of the columns where all columns that we are newly added are at the end of the table

In [34]:
overall_nba_playoffs_stats2.columns
Out[34]:
Index(['Player', 'Position', 'Age', 'Team', 'GamesPlayed', 'MinutesPlayed',
       'FieldGoals', 'FieldGoalsAttempts', 'FieldGoals%', '3Pointers',
       '3Pointer_Attempts', '3Pointers%', '2Pointers', '2Pointer_Attempts',
       '2Pointers%', 'EffectiveFieldGoals%', 'TotalPoints', 'Year', 'Playoff',
       'MinutesPerGame', 'FieldGoalsPerGame', '3PointerPerGame',
       '2PointerPerGame', 'TotalPointsPerGame'],
      dtype='object')

In this step, we are reordering the newly added columns properly in the table which were added at the end of the table earlier.

In [35]:
overall_nba_playoffs_stats2 = overall_nba_playoffs_stats2[['Player', 'Position', 'Age', 'Team', 'GamesPlayed', 'MinutesPlayed', 'MinutesPerGame', 'FieldGoals', 
                            'FieldGoalsPerGame', 'FieldGoalsAttempts', 'FieldGoals%', '3Pointers', '3PointerPerGame', '3Pointer_Attempts', 
                             '3Pointers%','2Pointers', '2PointerPerGame', '2Pointer_Attempts','2Pointers%', 'EffectiveFieldGoals%', 
                             'TotalPoints', 'TotalPointsPerGame', 'Year','Playoff']]
In [36]:
overall_nba_playoffs_stats2.columns
Out[36]:
Index(['Player', 'Position', 'Age', 'Team', 'GamesPlayed', 'MinutesPlayed',
       'MinutesPerGame', 'FieldGoals', 'FieldGoalsPerGame',
       'FieldGoalsAttempts', 'FieldGoals%', '3Pointers', '3PointerPerGame',
       '3Pointer_Attempts', '3Pointers%', '2Pointers', '2PointerPerGame',
       '2Pointer_Attempts', '2Pointers%', 'EffectiveFieldGoals%',
       'TotalPoints', 'TotalPointsPerGame', 'Year', 'Playoff'],
      dtype='object')

In the final step, making sure that shape of the dataframe as the correct amount of rows and columns as expected from the cleanup process. Furthermore made a last check for to make sure there are no nulls values in the columns. There were 3196 rows and 32 columns in the beginning after joining with reference data, then eliminated all 13 irrelevant columns, and added 5 calculated reference columns so that better analysis as well as interpretation of the data. A total of 24 rows were eliminated. After, doing the the cleanup process for phase one 3172 rows and 24 columns will be used for data visualization in the next phase.

In [37]:
overall_nba_playoffs_stats2.shape
overall_nba_playoffs_stats2.isnull().sum()
overall_nba_playoffs_stats2.info()
Out[37]:
(3172, 24)
Out[37]:
Player                  0
Position                0
Age                     0
Team                    0
GamesPlayed             0
MinutesPlayed           0
MinutesPerGame          0
FieldGoals              0
FieldGoalsPerGame       0
FieldGoalsAttempts      0
FieldGoals%             0
3Pointers               0
3PointerPerGame         0
3Pointer_Attempts       0
3Pointers%              0
2Pointers               0
2PointerPerGame         0
2Pointer_Attempts       0
2Pointers%              0
EffectiveFieldGoals%    0
TotalPoints             0
TotalPointsPerGame      0
Year                    0
Playoff                 0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3172 entries, 0 to 3195
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Player                3172 non-null   object 
 1   Position              3172 non-null   object 
 2   Age                   3172 non-null   int64  
 3   Team                  3172 non-null   object 
 4   GamesPlayed           3172 non-null   int64  
 5   MinutesPlayed         3172 non-null   int64  
 6   MinutesPerGame        3172 non-null   float64
 7   FieldGoals            3172 non-null   int64  
 8   FieldGoalsPerGame     3172 non-null   float64
 9   FieldGoalsAttempts    3172 non-null   int64  
 10  FieldGoals%           3172 non-null   float64
 11  3Pointers             3172 non-null   int64  
 12  3PointerPerGame       3172 non-null   float64
 13  3Pointer_Attempts     3172 non-null   int64  
 14  3Pointers%            3172 non-null   float64
 15  2Pointers             3172 non-null   int64  
 16  2PointerPerGame       3172 non-null   float64
 17  2Pointer_Attempts     3172 non-null   int64  
 18  2Pointers%            3172 non-null   float64
 19  EffectiveFieldGoals%  3172 non-null   float64
 20  TotalPoints           3172 non-null   int64  
 21  TotalPointsPerGame    3172 non-null   float64
 22  Year                  3172 non-null   object 
 23  Playoff               3172 non-null   object 
dtypes: float64(9), int64(10), object(5)
memory usage: 619.5+ KB
In [38]:
overall_nba_playoffs_stats2.to_csv('nba_stats_final_phase1.csv', header = True, mode = 'w', index=False)
STOP HERE for your EDA Phase 1 assignment. Submit your cleaned data file along with the completed notebook up to this point for grading.

EDA Phase 2

All of your work for the EDA Phase 2 assignment will begin below here. Refer to the detailed instructions and expectations for this assignment in Canvas.
In [39]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
import plotly.express as px
%matplotlib inline

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
In [40]:
df = pd.read_csv('nba_stats_final_phase1.csv')
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3172 entries, 0 to 3171
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Player                3172 non-null   object 
 1   Position              3172 non-null   object 
 2   Age                   3172 non-null   int64  
 3   Team                  3172 non-null   object 
 4   GamesPlayed           3172 non-null   int64  
 5   MinutesPlayed         3172 non-null   int64  
 6   MinutesPerGame        3172 non-null   float64
 7   FieldGoals            3172 non-null   int64  
 8   FieldGoalsPerGame     3172 non-null   float64
 9   FieldGoalsAttempts    3172 non-null   int64  
 10  FieldGoals%           3172 non-null   float64
 11  3Pointers             3172 non-null   int64  
 12  3PointerPerGame       3172 non-null   float64
 13  3Pointer_Attempts     3172 non-null   int64  
 14  3Pointers%            3172 non-null   float64
 15  2Pointers             3172 non-null   int64  
 16  2PointerPerGame       3172 non-null   float64
 17  2Pointer_Attempts     3172 non-null   int64  
 18  2Pointers%            3172 non-null   float64
 19  EffectiveFieldGoals%  3172 non-null   float64
 20  TotalPoints           3172 non-null   int64  
 21  TotalPointsPerGame    3172 non-null   float64
 22  Year                  3172 non-null   object 
 23  Playoff               3172 non-null   object 
dtypes: float64(9), int64(10), object(5)
memory usage: 594.9+ KB
Out[40]:
Player Position Age Team GamesPlayed MinutesPlayed MinutesPerGame FieldGoals FieldGoalsPerGame FieldGoalsAttempts ... 3Pointers% 2Pointers 2PointerPerGame 2Pointer_Attempts 2Pointers% EffectiveFieldGoals% TotalPoints TotalPointsPerGame Year Playoff
0 Quincy Acy PF 25 SAC 59 876 14.85 119 2.0 214 ... 0.388 100 2.0 165 0.606 0.600 307 5.0 2015-2016 N
1 James Anderson SG 26 SAC 51 721 14.14 67 1.0 178 ... 0.267 44 1.0 92 0.478 0.441 179 4.0 2015-2016 N
2 Marco Belinelli SG 29 SAC 68 1672 24.59 245 4.0 635 ... 0.306 154 2.0 338 0.456 0.457 696 10.0 2015-2016 N
3 Caron Butler SF 35 SAC 17 176 10.35 25 1.0 59 ... 0.167 22 1.0 41 0.537 0.449 63 4.0 2015-2016 N
4 Omri Casspi PF 27 SAC 69 1880 27.25 299 4.0 622 ... 0.409 187 3.0 348 0.537 0.571 813 12.0 2015-2016 N

5 rows × 24 columns

NBA Statistics Heatmap Analysis

In [41]:
columns = ['TotalPointsPerGame','FieldGoalsPerGame', '3PointerPerGame', '2PointerPerGame', 'MinutesPerGame']
df_corr = df[columns]
# setting up the heatmap
corrmat = df_corr.corr()

# set the figure size
f, ax = plt.subplots(figsize=(9, 6))
# TODO create a heat map using all six numeric variables. Pick a new color combination.
# https://matplotlib.org/3.1.1/gallery/color/colormap_reference.html
sns.heatmap(corrmat, vmax=.8, square=True, annot=True, cmap='RdYlGn', linewidths=.5 )
plt.title('Heatmap NBA Statistics Analysis')
#TODO explain how the visual cues of the heatmap represent the correlactions.
plt.savefig('Correlation Heat Map Beer Reviews')
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x125f5fd60>
Out[41]:
Text(0.5, 1.0, 'Heatmap NBA Statistics Analysis')

Explaination of how the visual cues of the heatmap represent the correlations.

TotalPointsPerGame Correlations Analysis:

Based on correlation matrix the dark green color shows high positive correlation between Total points per game and three other columns such as FieldGoals, 2 pointer, and minutes played by an NBA player per game. This relationship definitely makes sense because when we observe an NBA overall Total points per game it depends upon these columns in terms the amount of minutes played by a player, fields goals attempts in 2 pointer or 3 pointer shooting scores category by the player based on that the Most Valuable Player of the Game is awarded. Similarly, light green color show slight positive correlation as the amount of 3 pointers scored by an NBA player might be less based on player's position.

FieldGoalsPerGame Correlations Analysis:

In the Field goals per game correlation matrix as mentioned dark green shows high positive correlation between Field Goals per game and other three columns such as Total points, 2 pointer, and Minutes per game. Similar to analysis of total points the field goals also depends how many 2 pointers player attempted and scored points which gets tallies to the total points, and amount of minutes played is also important for this analysis. The yellow color shows moderate neutral correlation between fields goals and 3 pointers per games because as field goals included both 2 and 3 pointers shooting category there are two possibilities either player might have attempted more 2 pointers per games or player missed a lot of 3 pointers per game during the season.

3PointerPerGame Correlations with TotalPointsPerGame, FieldGoalsPerGame and MinutesPerGame Analysis:

The three pointer per game shows high positive relationship in dark green color with total points and minutes per game because its dependent on how minutes player as played also three pointer shooting score gets tallied to overall total points per game. On the other hand, three pointer per game shows moderate neutral correlation with field goals per game because players may have not been scoring based on the amount of fields goals they have attempted per game.

2PointerPerGame Correlations with TotalPointsPerGame, FieldGoalPerGame and MinutesPerGame Analysis:

The two pointer per game shows high positive relationship in dark green color with total points and fields goals per game because players might have been scoring around range or amount of fields goals they attempted per game. Also, two pointer shooting score gets tallied to overall total points per game. The minutes per game shows medium positive relationship because its dependent on how many minutes player have played which could affect their shooting category either in a positive or negative way.

3PointerPerGame Correlations with 2PointerPerGame Analysis:

The three pointer and two pointer per games shows strong negative correlation in red color as they both are two different shooting category and not dependent on each other in terms of contribution towards player's overall statistics.

Seaborn Box Plot Analysis

In [42]:
f, axes = plt.subplots(1,4)
plt.figsize=(10,20)
sns.boxplot(df['GamesPlayed'], orient = 'v', color = 'forestgreen', ax = axes[0])
sns.boxplot(df['Age'], orient = 'v', color = 'darkorange', ax = axes[1])
sns.boxplot(df['MinutesPerGame'], orient = 'v', color = 'lightgray', ax = axes[2])
sns.boxplot(df['FieldGoalsPerGame'], orient = 'v', color = 'dimgrey', ax = axes[3])
plt.tight_layout()


f, axes = plt.subplots(1,3)
plt.figsize=(10,20)
sns.boxplot(df['2PointerPerGame'], orient = 'v', color = 'darkviolet', ax = axes[0])
sns.boxplot(df['3PointerPerGame'], orient = 'v', color = 'darksalmon', ax = axes[1])
sns.boxplot(df['TotalPointsPerGame'], orient = 'v', color = 'lavender', ax = axes[2])
plt.tight_layout()
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x126427250>
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x126460cd0>
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x126494130>
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x1264bf550>
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x126575f40>
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x1265a3550>
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x1265cc970>

Explain Outliers: GamesPlayed, Age, MinutesPerGame and FieldGoalsPerGame

There are no significant outliers in the Gamesplayed box plot. The outliers in the Age column box plot are accurate we can see that upper extreme for the NBA player's age is 38 but beyond that are outliers based on data there are several players who are still in NBA around that age range of 40 and are in the team that make playoffs.There are no significant outliers in the MinutesPerGame box plot. The FieldsPerGame box plot shows accurate outliers because every season produces a group of players who achieve superior offensive statistics scoring around range of 9 to 11 fieldsgoals per game. Also, FieldsPerGame include both three and two pointer shots category statistics in this column based on position of an NBA player they shoot only 2 or 3 pointers.

Explain Outliers: 2PointerPerGame, 3PointerPerGame, and TotalPointsPerGame

The outliers for all the following columns are accurate as represented by box plot. The outliers for 2PointerPerGame lies beyond the upper extreme is around 7 to 10 two pointer per game. Similarly, for 3PointerPerGame box plot indicates NBA players shooting around the range of 3 to 5 three pointers per game are outliers. Finally, TotalPointsPerGame box plot looks kind of accurate based on our data because some players might be injured or have played less minutes per games compare to other which skews the box plot upper extreme of NBA players scoring greater than 22 points per games are outliers.

But the overwhelming majority of players who typically achieve subpar offensive statistics in following shooting category play a vital roles in our analysis because these outliers are NBA players leading in average two pointer, three pointer and total points per game category doing analysis on these outliers would help us in understanding and answering our hypothesis

Swarm Plot Analysis 1: Based on NBA Player's Position

In [43]:
plt.figure(figsize=(10, 6))
sns.swarmplot(x=df['3PointerPerGame'], y=df['TotalPointsPerGame'], palette='Spectral', hue=df['Position'])
plt.title("TotalPointsPerGame vs 3PointerPerGame", size=15)
plt.legend(bbox_to_anchor=(1.0, 1), loc=2, borderaxespad=1)
Out[43]:
<Figure size 720x432 with 0 Axes>
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x1266a26d0>
Out[43]:
Text(0.5, 1.0, 'TotalPointsPerGame vs 3PointerPerGame')
Out[43]:
<matplotlib.legend.Legend at 0x126580c70>

NBA Player Position
1. PF - Power Forwards
2. SG - Shooting Guards
3. SF - Small Forwards
3. C - Center
4. PG - Point Guards

Explaination of Swarm Plot

The swarm plot is more visible than a scatter plot and are effectively categorized like a bar plot. As we are comparing total points per game scored by NBA player compared to three pointer per game with help of swarm we are able to categorize how different NBA player's position have scored more three pointer per game in our historical data. Based on the swarm plot the red dots in the data corresponds NBA Shooting Guard (SG) position it shows that players in this position have an high average scoring range around 2 to 4 three pointer per game which leads them to having an average of 10 to 25 total points per games compare to other positions which have low scoring range from 0 to 1 three pointer per game

The light orange dots in the data corresponds to NBA Point Guard (PG) position which indicates that players in this position have an high average scoring range around 3 to 5 three pointer per game which leads them to having an average of 10 to 35 total points per games shooting three pointer compare to other positions.

It is interesting to note that players who play multiple positions (e.g., PF-C, SF-SG, PG-SG, PF-SF) do not make significant amount of total points per game. This shows that the players who are assigned multiple positions may have other unique responsibilities compared to traditionally NBA positions mentioned above.

Swarm Plot Analysis 2: Based on NBA Player's Position

In [44]:
plt.figure(figsize=(10, 6))
sns.swarmplot(x=df['2PointerPerGame'], y=df['TotalPointsPerGame'], palette='terrain', hue=df['Position'])
plt.title("TotalPointsPerGame vs 2PointerPerGame", size=15)
plt.legend(bbox_to_anchor=(1.0, 1), loc=2, borderaxespad=1)
Out[44]:
<Figure size 720x432 with 0 Axes>
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x1264eb9a0>
Out[44]:
Text(0.5, 1.0, 'TotalPointsPerGame vs 2PointerPerGame')
Out[44]:
<matplotlib.legend.Legend at 0x126563b50>

NBA Player Position
1. PF - Power Forwards
2. SG - Shooting Guards
3. SF - Small Forwards
3. C - Center
4. PG - Point Guards

Explaination of Swarm Plot

Based on the swarm plot the green and light green dots in the data corresponds to NBA Center (C) and Point Guard (PG) position which indicates that players in this position have an high average scoring range around 2 to 7 two pointer per game which leads them to having an average of 5 to 30 total points per games shooting two pointer compare to other positions.

The blue dots in the data corresponds NBA Power Forward (PF) position it shows that players in this position have an high average scoring range around 1 to 6 two pointer per game which leads them to having an average of 5 to 20 total points per games shooting two pointer compare to other positions. Also, it interesting to note that players who play multiple positions (e.g., PF-C, SF-SG, PG-SG, PF-SF) do not make significant amount of total points per game because they might be only be playing multiple positions sometimes during a season, not regularly compare to their normal position.

Pie Chart Analyis: Total 3 Pointer for Overall Season from 2015-2020 Categorized NBA Player Positions

In [45]:
import plotly.graph_objs as go
fig = px.pie(df, values='3Pointers', names='Position',
             title='NBA Players Total 3 Pointers Statistics Season 2015-2020',
             hover_data=['Position'], labels={'3Pointers'})
fig.update_traces(textposition='inside', textinfo='percent+label')

Hover over Data on NBA Player Position
1. PF - Power Forwards
2. SG - Shooting Guards
3. SF - Small Forwards
3. C - Center
4. PG - Point Guards

Explaination of Pie Chart: NBA Player 3 Pointers Statistics Per Season from 2015-2020 Categorized by Positions

To further confirm my hunch either correct or incorrect that the overall data will reveal positive correlation between NBA players in Point Guard (PG) and Shooting Guard (SG) position are likely to score an average of higher three pointers compared to other position. We decided to look at broader data with help of pie chart which represents the overall NBA players 3 pointers statistics per season from year 2015 to 2020. It is sliced into percentages of different NBA positions based on their shootings score per season.

Shooting Guard (SG) and Point Guard (PG) Analysis:
As we can see that 31.6% of the 3 pointers are scored by players in the Shooting Guard (SG) position with the total of 43,471 three pointers from our historical data. Similarly, 22.5% of 3 pointers are scored by NBA players in Point Guard (PG) position with total of 30,924 three pointers for five regular season. It clearly shows that Shooting Guard position have always scored more three pointers per season then any other position in NBA

Small Forwards (SF) and Power Forwards (PF) Analysis:
On the other hand, NBA players playing in Small Forward (SF) have scored 27,797 total three pointers per season,as they cover 20.2% of the data. The pie chart also indicates that 18.3% of 3 pointers per season are scored by players in the Power Forwards position. It is surprising to see there are is not much difference in between Small Forwards, Power Forwards, and Point Guards positions shooting three pointer category. It leads to the fact that NBA player's position doesn't really dependent upon their scoring points style. As shown in the pie chart with three pointer per season data in this case players might trying to learn how to improve in both category over the years which led to us seeing closer percentages.

Center (C) and Multiple Position Analysis:
Center (C) and other multiple positions have combined percentage of only 7.4% in the three pointer per season shooting category for five year historical data which shows they are outliers based on our assumptions they are more likely to attempt more two pointers than three pointers per season. Also, players with multiple positions do not make significant amount of three pointers per season because they might be only be playing multiple positions only sometimes during a season, not regularly compare to their normal position.

Pie Chart Analysis Total 2 Pointer for Overall Season from 2015-2020 Categorized By NBA Player Positions

In [46]:
import plotly.express as px
fig = px.pie(df, values='2Pointers', names='Position',
             title='NBA Players Total 2 Pointers Statistics Season 2015-2020',
             hover_data=['Position'], labels={'2Pointers'})
fig.update_traces(textposition='inside', textinfo='percent+label')

Hover over Data on NBA Player Position
1. PF - Power Forwards
2. SG - Shooting Guards
3. SF - Small Forwards
3. C - Center
4. PG - Point Guards

Explaination of Pie Chart: NBA Player 2 Pointers Statistics Per Season from 2015-2020 Categorized by Positions

To further confirm my hunch either correct or incorrect that the overall data will reveal positive correlation between NBA players in Power Forward (PF), Small Forward (SF), and Center (C) position are likely to score an average of higher two pointers compared to other position. We decided to look at broader data with help of pie chart which represents the overall NBA players 2 pointers statistics per season from year 2015 to 2020. It is sliced into percentages of different NBA positions based on their shootings score per season.

Center (C) and Point Guard (PG) Analysis:
As we can see that 24.2% of the 2 pointers are scored by players in the Center (C) position with the total of 92,408 two pointers from our historical data. Similarly, 20.3% of 2 pointers are scored by NBA players in Point Guard (PG) position with total of 77,449 two pointers for five regular season. It clearly shows that Center position have always scored more two pointers per season then any other position in NBA

Power Forwards (SF) and Shooting Guards (SG) Analysis:
On the other hand, NBA players playing in Power Forward (PF) have scored 74,804 total two pointers per season,as they cover 19.6% of the data. The pie chart also indicates that 19.6% of 2 pointers per season are scored by players in the Shooting Guard (SG) position. It is surprising to see there is tie between Power Forwards and Shooting Guard position in the shooting two pointer category. It leads to the fact that NBA player's position doesn't really dependent upon their scoring points style. As shown in the pie chart with two pointer per season data in this case players might be trying to learn how to improve in both category over the years which led to us seeing tie in terms of percentages between these two NBA positions.

Small Forwards (SF) and Multiple Position Analysis:
Small Forwards (SF) and other multiple positions have combined percentage of only 16.3% in the two pointer per season shooting category for five year historical data which shows they are outliers based on our hunch was incorrect we thought they are more likely to attempt more two pointers than three pointers per season which shows that there is neutral correlation between NBA player's position and the offensive shooting category. The multiple positions only contributed 0.7% of data which indicates that they do not make significant amount of two pointers per season because they might be only be playing multiple positions only sometimes during a season, not regularly compare to their normal position.

Polar Line Plot Analysis :

Total 3 Pointer Per Game for Overall Season from 2015-2020 Categorized By Players Made Playoffs or Not

In [47]:
import plotly.express as px
fig = px.line_polar(df, r='3PointerPerGame', theta='Year',color='Player', hover_name='Playoff', line_close=True, width=800, height=500)
fig.show()

Hover over Year Angle for Leading Scorers in 3 Pointer Per Game Category
1. 2015-2016 shows green line
2. 2016-2017 shows green line
3. 2017-2018 shows green and pink line
3. 2018-2019 shows green and pink line
5. 2019-2020 represents orange line

Polar Line Plot Analysis:
The polar line plot will helps us to understand which NBA players is leading in the three pointer per game category from the year 2015 to 2020 based on the player's data line which shows the amount of three pointer they have scored thats touches particular year angle. Also, retrieve information if the NBA player leading in points made to the playoffs or not.

Let's starts with the regular season year 2015-2016 and 2016-2017 as we hover over the green line on these year column angle we can see that Steph Curry is leading in three pointer category for both these years, and have made the playoffs.

Similarly, there is tie between James Harden indicated in the pink line , and Steph Curry data is in green line as they both touch 2017-2018 and 2018-2019 year angle with leading score of 4 three pointer and 5 three pointer per games for those years.

Finally, Damian Lillard indicated with orange line being a leading scorer with 4 three pointer per game for the year 2019-2020. The data reveals that all the 3 NBA players who are the leading points scorers in the three pointer category have made playoffs. To further confirm our assumptions above will have to look at 2 pointer per game category as well, and see this pattern persist.

Total 2 Pointer Per Game for Overall Season from 2015-2020 Categorized By Players Made Playoffs or Not

In [48]:
import plotly.express as px
fig = px.line_polar(df, r='2PointerPerGame', theta='Year',color='Player', hover_name='Playoff', line_close=True, width=800, height=500)
fig.show()

Hover over Year Angle for Leading Scorers in 2 Pointer Per Game Category
1. 2015-2016 shows red and orange line
2. 2016-2017 shows red line
2. 2017-2018 shows red line
3. 2018-2019 shows red and blue line
5. 2019-2020 represents pink line

Polar Line Plot Analysis:
The polar line plot will helps us to understand which NBA players is leading in the two pointer per game category from the year 2015 to 2020 based on the player's data line which shows the amount of two pointer they have scored thats touches particular year angle. Also, retrieve information if the NBA player leading in points made to the playoffs or not.

Let's starts with the regular season year 2015-2016,2016-2017, 2017-2018, and 2018-2019 as we hover over the red line on these year column angle we can see that Anthony Davisis leading in two pointer category for these four years, but have only made it to the playoffs for once out of 4 years.

In addition, there is a tie between LaMarcus Aldridge indicated in the orange line, and Anthony Davis data is in red line as they both touch 2015-2016 year angle with leading score of 9 two pointer per games for this year but only LaMarcus Aldridge made the playoffs that year, and not Anthony Davis.

Similarly, there is a tie between Giannis Antetokounmpo indicated in the blue line, and Anthony Davis data is in red line as they both touch 2018-2019 year angle with leading score of 9 two pointer per games for this year but only Giannis Antetokounmpo made the playoffs that year, and not Anthony Davis.

Finally, Russell Westbrook indicated with pink line being a leading scorer with 10 two pointer per game for the year 2019-2020. The data reveals that that only 2 out of 3 NBA players who are the leading points scorers in the two pointer category have made playoffs. So, our assumption was wrong the pattern from previous polar plot did not persist, regarding NBA players leading in the shooting category always make the playoffs.

Scatter Plot Analysis: GamesPlayed vs MinutesPerGame from 2015-2020

In [49]:
import plotly.express as px
fig = px.scatter(
    df, x='MinutesPerGame', y='GamesPlayed', size='TotalPoints', size_max=13, color_continuous_scale='rdylbu_r',
    color='Playoff', hover_name='Player', trendline="ols", title='GamesPlayed vs MinutesPerGame')
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')),
                  selector=dict(mode='markers'))

Hover Over Data For Addition Information
1. Player Name
2. Total Points Scored from 2015-2020
4. Games Played
5. Minutes Per Game
6. N - Did not make Playoffs (Blue Dots)
7. Y - Made it to the Playoffs (Red Dots)

Scatter Plot Analysis:
The main goal of this scatter plot is to analyze if the data indicates any correlation between GamesPlayed and MinutesPerGame. Also, how does it affect players chances of making playoffs or not.

As we look at the trendline in the scatter plot we see that a lot of red dots in the right upper quadrant of the plot where NBA players who have played around range of 75 - 80 games and have an average of greater than 30 minutes per game have more chances of making playoffs, and scoring more points based on the historical data.

On the other hand, trendline for the blue dots starts at the lower left quadrant to mid upper right quadrant of the plot where NBA who have played around range of 20 - 65 games and have an average of less than 30 minutes per game have less chances of making playoffs, and scoring points based on the historical data.

But there are some exceptions in the scatter plot for instance there are blue dots in the right upper quadrant of the plot where some NBA players have either played all 82 games or above 75 games, and have an average playing time of more than 30 minutes per games but still not make playoffs. So based on our analysis we can conclude that there is a neutral correlation between GamesPlayed and MinutesPerGame with a dependencies on overall Team statistics with would affect players chances of making playoffs.

Heat Analysis: On Individual NBA Player Statistics based on Average 3 Pointer Per Game

Since the 3PointerPerGame shooting category was highly correlated with overall players statistics we will use 3PointerPerGame score for our analysis to find top 20 players who have a high average in the three pointer shooting category. Based on the following heat map data analysis it would help us reveal that either our hunch is correct or incorrect regarding NBA players who have been scoring an average of high 3-pointers per game and are more likely to be in teams that make the NBA playoffs.

Let's look at top 50 individual NBA players statistics grouped in four different categories by Team, position, playoff, year, and sorted on average of 3 pointers per game

In [50]:
# look at heat map analysis on overall mean 3 Pointer Per Game for five-year historical data

nba_3Pointer =df.groupby(['Player','Team','Position','Playoff','Year']).agg({'3PointerPerGame':['mean']})
nba_3Pointer.columns = ['3PointerPerGame_Mean']
players_avg_3PointerPerGame = nba_3Pointer.sort_values(by=['3PointerPerGame_Mean'], ascending = False)[:50]
players_avg_3PointerPerGame.style.background_gradient(cmap = 'Blues')
Out[50]:
3PointerPerGame_Mean
Player Team Position Playoff Year
Stephen Curry GSW PG Y 2018-2019 5.000000
2015-2016 5.000000
James Harden HOU PG Y 2018-2019 5.000000
Paul George OKC SF Y 2018-2019 4.000000
Malik Beasley MIN SG N 2019-2020 4.000000
James Harden HOU SG Y 2017-2018 4.000000
2019-2020 4.000000
Stephen Curry GSW PG Y 2017-2018 4.000000
2016-2017 4.000000
Dāvis Bertāns WAS PF N 2019-2020 4.000000
D'Angelo Russell TOT PG N 2019-2020 4.000000
GSW PG N 2019-2020 4.000000
Damian Lillard POR PG Y 2019-2020 4.000000
Buddy Hield SAC SG N 2019-2020 4.000000
R.J. Hunter BOS SG Y 2018-2019 4.000000
Duncan Robinson MIA SG Y 2019-2020 4.000000
Trevor Ariza HOU SF Y 2017-2018 3.000000
Tim Hardaway Jr. DAL SG Y 2019-2020 3.000000
James Harden HOU SG Y 2015-2016 3.000000
Eric Gordon HOU SG Y 2017-2018 3.000000
2018-2019 3.000000
Allen Crabbe BRK SG N 2017-2018 3.000000
Fred VanVleet TOR SG Y 2019-2020 3.000000
Eric Gordon NOP SG N 2015-2016 3.000000
Danilo Gallinari OKC PF Y 2019-2020 3.000000
Marcus Morris NYK SF N 2019-2020 3.000000
Landry Shamet LAC SG Y 2018-2019 3.000000
Jaren Jackson Jr. MEM C N 2019-2020 3.000000
Kevin Durant GSW SF Y 2017-2018 3.000000
Robert Covington PHI SF Y 2017-2018 3.000000
Tim Hardaway Jr. NYK SG N 2018-2019 3.000000
Bradley Beal WAS SG Y 2016-2017 3.000000
Nick Young LAL SG N 2016-2017 3.000000
Kevin Durant OKC SF Y 2015-2016 3.000000
Eric Gordon HOU SG Y 2016-2017 3.000000
Bogdan Bogdanović SAC SG N 2019-2020 3.000000
Kevin Love CLE PF N 2019-2020 3.000000
Bojan Bogdanović UTA SF Y 2019-2020 3.000000
Bradley Beal WAS SG N 2019-2020 3.000000
Robert Covington PHI SF N 2015-2016 3.000000
Devonte' Graham CHO PG N 2019-2020 3.000000
Justin Holiday CHI SG N 2018-2019 3.000000
Damian Lillard POR PG Y 2018-2019 3.000000
Kemba Walker CHO PG N 2017-2018 3.000000
James Ennis NOP SF N 2015-2016 3.000000
Wayne Ellington TOT SG N 2018-2019 3.000000
Karl-Anthony Towns MIN C N 2019-2020 3.000000
Luka Dončić DAL PG Y 2019-2020 3.000000
Otto Porter CHI SF N 2018-2019 3.000000
Trae Young ATL PG N 2019-2020 3.000000

Explaination of Heat Map Analysis: Top 50 NBA Player Statistics based on Average 3 Pointer Per Game

In the top 50 NBA player statistics 5 three pointer per game is the highest average points in the 3 pointer shooting category scored by Steph Curry and James Harden also they both were in teams that mades the playoffs, and have been consistent in their performance over the years based on the data.

But if we look at the broader picture in the heat map data analysis even though there are some players that appear twice in the data above because either they played for different teams or their consistently performing well in multliple position in different year scoring a high average of three pointer per games. But the main point is there are 23 out of 50 NBA player were not teams that made the playoffs which 46% of the data which concludes that are hunch is incorrect regarding players scoring an average of high 3-pointers per game and are more likely to be in teams that make the NBA playoffs. To further prove our assumptions on this heat map analysis we will narrow the data down to top 10 NBA players based on average 3 pointers per game

Top 10 NBA Player Statistics based on Average 3 Pointer Per Game

In [51]:
players_avg_3PointerPerGame.sort_values(by=['3PointerPerGame_Mean'], ascending = False)[:10]
Out[51]:
3PointerPerGame_Mean
Player Team Position Playoff Year
Stephen Curry GSW PG Y 2018-2019 5.0
James Harden HOU PG Y 2018-2019 5.0
Stephen Curry GSW PG Y 2015-2016 5.0
Dāvis Bertāns WAS PF N 2019-2020 4.0
Duncan Robinson MIA SG Y 2019-2020 4.0
R.J. Hunter BOS SG Y 2018-2019 4.0
Buddy Hield SAC SG N 2019-2020 4.0
Damian Lillard POR PG Y 2019-2020 4.0
D'Angelo Russell TOT PG N 2019-2020 4.0
GSW PG N 2019-2020 4.0

Explaination of Heat Map Analysis: Top 10 NBA Player Statistics based on Average 3 Pointer Per Game

Now, that we have narrowed our data down to top 10 NBA players having a high average of 3 pointer per game we can see that some of the players have been consistently performing well in their individual statistics shooting like Steph Curry leading in points in the year 2018 and 2015, also being in teams that playoffs. Similarly, 'James Harden' have tied shooting average of 5 three pointers per game.

On the other hand, there are players for instance D'Angelo Russell who has been scoring consistently with 4 three pointer per game in the year '2019' for two different teams but did not make the playoff. In addition, the data reveals that 4 out of 10 players in the leading three points shooting category were not in the teams that made the playoffs which is 40% of the data above. It further proves that individual players statistics do not correlate to their chances of always making the playoffs or being in the teams that make the playoffs.

Heat Analysis: On Individual NBA Player Statistics based on Average 2 Pointer Per Game

We should now analyze 2PointerPerGame shooting category with overall players statistics to see if the data reveals similar patterns in comparison three pointer shooting category. We will use 2PointerPerGame score for our analysis to find top 50 players who have a high average in the two pointer shooting category. Based on the following heat map data analysis it would help us reveal that either our hunch is correct or incorrect regarding NBA players who have been scoring an average of high 2-pointers per game and are more likely to be in teams that make the NBA playoffs.

Let's look at top 50 individual NBA players statistics grouped in four different categories by Team, position, playoff, year, and sorted on average of 2 pointers per game

In [52]:
# look at heat map analysis on overall mean 2 Pointer Per Game for five-year historical data
nba_2Pointer =df.groupby(['Player','Team','Position','Playoff','Year']).agg({'2PointerPerGame':['mean']})
nba_2Pointer.columns = ['2PointerPerGame_Mean']

players_avg_2PointerPerGame = nba_2Pointer.sort_values(by=['2PointerPerGame_Mean'], ascending = False)[:50]
players_avg_2PointerPerGame.style.background_gradient(cmap = 'RdYlGn')
Out[52]:
2PointerPerGame_Mean
Player Team Position Playoff Year
Anthony Davis NOP PF Y 2017-2018 10.000000
Russell Westbrook HOU PG Y 2019-2020 10.000000
Anthony Davis NOP C N 2016-2017 10.000000
LeBron James CLE SF Y 2015-2016 9.000000
Giannis Antetokounmpo MIL PF Y 2019-2020 9.000000
LeBron James CLE PF Y 2017-2018 9.000000
DeMar DeRozan TOR SG Y 2016-2017 9.000000
Anthony Davis NOP C N 2015-2016 9.000000
2018-2019 9.000000
Karl-Anthony Towns MIN C N 2016-2017 9.000000
Giannis Antetokounmpo MIL PF Y 2017-2018 9.000000
2018-2019 9.000000
LaMarcus Aldridge SAS C Y 2017-2018 9.000000
Nikola Vučević ORL C N 2015-2016 8.000000
Russell Westbrook OKC PG Y 2017-2018 8.000000
DeMar DeRozan SAS SF N 2019-2020 8.000000
Joel Embiid PHI C Y 2018-2019 8.000000
DeMarcus Cousins SAC C N 2015-2016 8.000000
LeBron James LAL SF N 2018-2019 8.000000
Zion Williamson NOP PF N 2019-2020 8.000000
Anthony Davis LAL PF Y 2019-2020 8.000000
DeMar DeRozan SAS SG Y 2018-2019 8.000000
Nikola Vučević ORL C Y 2018-2019 8.000000
Giannis Antetokounmpo MIL SF Y 2016-2017 8.000000
Blake Griffin LAC PF Y 2015-2016 8.000000
Russell Westbrook OKC PG Y 2016-2017 8.000000
LeBron James CLE SF Y 2016-2017 8.000000
Deandre Ayton PHO C N 2019-2020 8.000000
LaMarcus Aldridge SAS C Y 2018-2019 8.000000
Brook Lopez BRK C N 2015-2016 8.000000
Jonas Valančiūnas MEM C N 2018-2019 8.000000
T.J. Warren PHO SF N 2017-2018 8.000000
Andre Drummond DET C Y 2015-2016 7.000000
Kevin Durant GSW SF Y 2018-2019 7.000000
Harrison Barnes DAL PF N 2016-2017 7.000000
Andre Drummond TOT C N 2019-2020 7.000000
DET C Y 2018-2019 7.000000
Kevin Durant GSW SF Y 2017-2018 7.000000
Andre Drummond DET C N 2019-2020 7.000000
CLE C N 2019-2020 7.000000
Deandre Ayton PHO C N 2018-2019 7.000000
Blake Griffin LAC PF Y 2016-2017 7.000000
Kyrie Irving BRK PG Y 2019-2020 7.000000
Domantas Sabonis IND PF Y 2019-2020 7.000000
John Wall WAS PG Y 2016-2017 7.000000
Kevin Durant GSW PF Y 2016-2017 7.000000
Jabari Parker MIL PF Y 2016-2017 7.000000
Kyrie Irving CLE PG Y 2016-2017 7.000000
Karl-Anthony Towns MIN C N 2015-2016 7.000000
LaMarcus Aldridge SAS PF Y 2015-2016 7.000000

Explaination of Heat Map Analysis: Top 50 NBA Player Statistics based on Average 2 Pointer Per Game

In the top 50 NBA player statistics 10 three pointer per game is the highest average points in the 2 pointer shooting category scored by Anthony Davis and Russell Westbrook also they both were in teams that mades the playoffs, and have been consistent in their performance over the years based on the data. But at the same time Anthony Davis has also not made in the playoffs while playing for the same team in different seasons. It shows that having good performance in the 2 pointer shooting category does not depend upon guarantee playoffs spot.

But if we look at the broader picture in the heat map data analysis even though there are some players like Anthony Davis, LeBron James,and Russell Westbrook that appear twice in the data above because either they played for different teams or their consistently performing in well in multiple position in different years scoring a high average of two pointer per games. But the main point is there are 18 out of 50 NBA player were not teams that made the playoffs which 36% of the data which concludes that are hunch is incorrect regarding players scoring an average of high 2-pointers per game and are more likely to be in teams that make the NBA playoffs. To further prove our assumptions on this heat map analysis we will narrow the data down to top 10 NBA players based on average 2 pointers per game

Top 10 NBA Player Statistics based on Average 2 Pointer Per Game

In [53]:
players_avg_2PointerPerGame.sort_values(by=['2PointerPerGame_Mean'], ascending = False)[:10]
Out[53]:
2PointerPerGame_Mean
Player Team Position Playoff Year
Anthony Davis NOP PF Y 2017-2018 10.0
C N 2016-2017 10.0
Russell Westbrook HOU PG Y 2019-2020 10.0
Anthony Davis NOP C N 2018-2019 9.0
LaMarcus Aldridge SAS C Y 2017-2018 9.0
Giannis Antetokounmpo MIL PF Y 2017-2018 9.0
Karl-Anthony Towns MIN C N 2016-2017 9.0
Giannis Antetokounmpo MIL PF Y 2018-2019 9.0
Anthony Davis NOP C N 2015-2016 9.0
DeMar DeRozan TOR SG Y 2016-2017 9.0

Explaination of Heat Map Analysis: Top 10 NBA Player Statistics based on Average 2 Pointer Per Game

Now, that we have narrowed our data down to top 10 NBA players having a high average of 2 pointer per game we can see that some of the players have been consistently performing well in their individual statistics shooting like Anthony Davis leading in points in the year 2017 and 2016, also being in teams that playoffs. At the same time Anthony Davis have scored an average of 9 two per games in the year 2018 and 2015 but did not make playoffs with same team.

On the other hand, there are two players for instance Karl-Anthony Towns and DeMar DeRozan who has been scoring consistently with an average of 9 two pointer per game in the same year 2016-2017 are in two different team but only one of them made the playoffs. The data indicates that overall teams performance plays a vital roles in making playoffs then just individual players statistics.

In addition, the data reveals that 4 out of 10 players in the leading two points shooting category were not in the teams that made the playoffs which is 40% of the data above. It further proves that individual players statistics do not correlate to their chances of always making the playoffs or being in the teams that make the playoffs.

Heat Analysis: On Individual NBA Player Statistics based on Average Fields Goals Per Game

We should now analyze FieldGoalsPerGame shooting category with overall players statistics to see if the data reveals similar patterns in comparison two pointer shooting category. We will use FieldGoalsPerGame score for our analysis to find top 50 players who have a high average in the fields goals shooting category. Also, include both three and two pointer shots category statistics in this column based on position of an NBA player they shoot only 2 or 3 pointers. Based on the following heat map data analysis it would help us reveal that either our hunch is correct or incorrect regarding NBA players who have been scoring an average of high fields goals per game and are more likely to be in teams that make the NBA playoffs.

Let's look at top 50 individual NBA players statistics grouped in four different categories by Team, position, playoff, year, and sorted on average of field goals per game

In [54]:
# look at mean of overall 
nba_fieldGoals =df.groupby(['Player','Team','Position','Playoff','Year']).agg({'FieldGoalsPerGame':['mean']})
nba_fieldGoals.columns = ['FieldGoalsPerGame_Mean']

players_avg_fieldGoalsPerGame = nba_fieldGoals.sort_values(by=['FieldGoalsPerGame_Mean'], ascending = False)[:50]
players_avg_fieldGoalsPerGame.style.background_gradient(cmap = 'coolwarm')
Out[54]:
FieldGoalsPerGame_Mean
Player Team Position Playoff Year
James Harden HOU PG Y 2018-2019 11.000000
Russell Westbrook HOU PG Y 2019-2020 11.000000
Giannis Antetokounmpo MIL PF Y 2019-2020 11.000000
Kyrie Irving BRK PG Y 2019-2020 10.000000
Bradley Beal WAS SG N 2019-2020 10.000000
Russell Westbrook OKC PG Y 2016-2017 10.000000
James Harden HOU SG Y 2019-2020 10.000000
Stephen Curry GSW PG Y 2015-2016 10.000000
Luka Dončić DAL PG Y 2019-2020 10.000000
Karl-Anthony Towns MIN C N 2016-2017 10.000000
LeBron James LAL SF N 2018-2019 10.000000
PG Y 2019-2020 10.000000
CLE SF Y 2016-2017 10.000000
PF Y 2017-2018 10.000000
Giannis Antetokounmpo MIL PF Y 2017-2018 10.000000
2018-2019 10.000000
LeBron James CLE SF Y 2015-2016 10.000000
DeMar DeRozan TOR SG Y 2016-2017 10.000000
Anthony Davis NOP PF Y 2017-2018 10.000000
Kevin Durant OKC SF Y 2015-2016 10.000000
Anthony Davis NOP C N 2016-2017 10.000000
Kyrie Irving BOS PG Y 2018-2019 9.000000
Russell Westbrook OKC PG Y 2017-2018 9.000000
Isaiah Thomas BOS PG Y 2016-2017 9.000000
Victor Oladipo IND SG Y 2017-2018 9.000000
Russell Westbrook OKC PG Y 2018-2019 9.000000
CJ McCollum POR SG Y 2019-2020 9.000000
Kevin Durant GSW SF Y 2017-2018 9.000000
James Harden HOU SG Y 2015-2016 9.000000
Bradley Beal WAS SG N 2018-2019 9.000000
Kevin Durant GSW SF Y 2018-2019 9.000000
Zach LaVine CHI SG N 2019-2020 9.000000
Trae Young ATL PG N 2019-2020 9.000000
CJ McCollum POR SG Y 2016-2017 9.000000
Damian Lillard POR PG Y 2016-2017 9.000000
2017-2018 9.000000
James Harden HOU SG Y 2017-2018 9.000000
Paul George OKC SF Y 2018-2019 9.000000
Kyrie Irving CLE PG Y 2016-2017 9.000000
Kawhi Leonard LAC SF Y 2019-2020 9.000000
LaMarcus Aldridge SAS C Y 2017-2018 9.000000
Andrew Wiggins MIN SF N 2016-2017 9.000000
Kyrie Irving BOS PG Y 2017-2018 9.000000
Kemba Walker CHO PG N 2018-2019 9.000000
Kawhi Leonard TOR SF Y 2018-2019 9.000000
SAS SF Y 2016-2017 9.000000
Anthony Davis LAL PF Y 2019-2020 9.000000
Blake Griffin LAC PF Y 2015-2016 9.000000
Damian Lillard POR PG Y 2019-2020 9.000000
Anthony Davis NOP C N 2015-2016 9.000000

Explaination of Heat Map Analysis: Top 50 NBA Player Statistics based on Field Goals Per Game

In the top 50 NBA player statistics 11 field goals per game is the highest average points in the field goals shooting category scored by James Harden, Russell Westbrook as they both were in same team, and Giannis Antetokounmpo was in different team that made the playoffs, and have been consistent in their performance over the years based on the data.

But if we look at the broader picture in the heat map data analysis even though there are some players like LeBron James, Kyrie Irving, and Kevin Durant that appear more than twice in the data above because either they played for different teams or their consistently performing in well in multiple position in different years scoring a high average of fields goals per game. But the main point is there are 10 out of 50 NBA player were not on teams that made the playoffs which 20% of the data.

Since, field goals column includes both players shooting 2 and 3 pointer per games it concludes that are hunch is incorrect regarding players scoring an average of high 3-pointers or 2-pointers per game and are more likely to be in teams that make the NBA playoffs because according to the field goals per games analysis above shows that players should be able to score better in both shooting category. To further prove our assumptions on this heat map analysis we will narrow the data down to top 10 NBA players based on average field goals per game.

Top 10 NBA Player Statistics based on Average Field Goals Per Game

In [55]:
players_avg_fieldGoalsPerGame.sort_values(by=['FieldGoalsPerGame_Mean'], ascending = False)[:10]
Out[55]:
FieldGoalsPerGame_Mean
Player Team Position Playoff Year
James Harden HOU PG Y 2018-2019 11.0
Giannis Antetokounmpo MIL PF Y 2019-2020 11.0
Russell Westbrook HOU PG Y 2019-2020 11.0
LeBron James CLE SF Y 2016-2017 10.0
Anthony Davis NOP C N 2016-2017 10.0
Kevin Durant OKC SF Y 2015-2016 10.0
Anthony Davis NOP PF Y 2017-2018 10.0
LeBron James CLE SF Y 2015-2016 10.0
Giannis Antetokounmpo MIL PF Y 2018-2019 10.0
2017-2018 10.0

Explaination of Heat Map Analysis: Top 10 NBA Player Statistics based on Average Field Goals Per Game

Now, that we have narrowed our data down to top 10 NBA players having a high average of field goals per game we can see that some of the players have been consistently performing well in their individual statistics shooting like Giannis Antetokounmpo leading in points in the year 2019 and 2018, also being in same teams that made playoffs. Similarly, James Harden and Russell Westbrookhave tied shooting average of 11 fields goals per game on same team.

On the other hand, there are players for instance Anthony Davis who has been scoring consistently with 10 field goals per game in the year '2016' and '2017' with same team but made the playoffs only once. In addition, the data reveals that 1 out of 10 NBA players in the above heat map analysis were not on teams that made the playoffs which is 10% of the data above. It further proves that individual players statistics in the fields goals per game category do correlate to their chances of always making the playoffs or being in the teams that make the playoffs.

Heat Analysis: On Individual NBA Player Statistics based on Average Total Points Per Game

We should now analyze TotalPointsPerGame shooting category with overall players statistics to see if the data reveals similar patterns in comparison field goals category. We will use TotalPointsPerGame score for our analysis to find top 50 players who have a high average in the total points per game category. Based on the following heat map data analysis it would help us reveal that either our hunch is correct or incorrect regarding NBA players who have been scoring an average of high total points per game and are more likely to be in teams that make the NBA playoffs.

Let's look at top 50 individual NBA players statistics grouped in four different categories by Team, position, playoff, year, and sorted on average of total points per game

In [56]:
# look at mean of overall 3 Pointers  and the number of reviews
nba_totalpoints =df.groupby(['Player', 'Team','Playoff','Position','Year']).agg({'TotalPointsPerGame':['mean']})
nba_totalpoints.columns = ['TotalPointsPerGame_Mean']

players_avg_totalpointsPerGame = nba_totalpoints.sort_values(by=['TotalPointsPerGame_Mean'],ascending = False)[:50]
players_avg_totalpointsPerGame.style.background_gradient(cmap = 'CMRmap')
Out[56]:
TotalPointsPerGame_Mean
Player Team Playoff Position Year
James Harden HOU Y PG 2018-2019 36.000000
SG 2019-2020 34.000000
Russell Westbrook OKC Y PG 2016-2017 32.000000
Bradley Beal WAS N SG 2019-2020 31.000000
Trae Young ATL N PG 2019-2020 30.000000
Stephen Curry GSW Y PG 2015-2016 30.000000
Damian Lillard POR Y PG 2019-2020 30.000000
James Harden HOU Y SG 2017-2018 30.000000
Isaiah Thomas BOS Y PG 2016-2017 29.000000
Luka Dončić DAL Y PG 2019-2020 29.000000
James Harden HOU Y SG 2015-2016 29.000000
Giannis Antetokounmpo MIL Y PF 2019-2020 29.000000
James Harden HOU Y PG 2016-2017 29.000000
Paul George OKC Y SF 2018-2019 28.000000
Kevin Durant OKC Y SF 2015-2016 28.000000
Joel Embiid PHI Y C 2018-2019 28.000000
Anthony Davis NOP N C 2016-2017 28.000000
DeMarcus Cousins SAC N C 2016-2017 28.000000
Anthony Davis NOP Y PF 2017-2018 28.000000
Giannis Antetokounmpo MIL Y PF 2018-2019 28.000000
2017-2018 27.000000
DeMarcus Cousins TOT N C 2016-2017 27.000000
Kawhi Leonard LAC Y SF 2019-2020 27.000000
LeBron James CLE Y PF 2017-2018 27.000000
Stephen Curry GSW Y PG 2018-2019 27.000000
DeMarcus Cousins SAC N C 2015-2016 27.000000
Damian Lillard POR Y PG 2016-2017 27.000000
2017-2018 27.000000
DeMar DeRozan TOR Y SG 2016-2017 27.000000
Kawhi Leonard TOR Y SF 2018-2019 27.000000
Kyrie Irving BRK Y PG 2019-2020 27.000000
LeBron James LAL N SF 2018-2019 27.000000
Devin Booker PHO N SG 2018-2019 27.000000
Russell Westbrook HOU Y PG 2019-2020 27.000000
Devin Booker PHO N SG 2019-2020 27.000000
LeBron James CLE Y SF 2016-2017 26.000000
Kevin Durant GSW Y SF 2017-2018 26.000000
Kemba Walker CHO N PG 2018-2019 26.000000
Stephen Curry GSW Y PG 2017-2018 26.000000
Damian Lillard POR Y PG 2018-2019 26.000000
Anthony Davis NOP N C 2018-2019 26.000000
Bradley Beal WAS N SG 2018-2019 26.000000
Anthony Davis LAL Y PF 2019-2020 26.000000
Karl-Anthony Towns MIN N C 2019-2020 26.000000
Zach LaVine CHI N SG 2019-2020 26.000000
Kawhi Leonard SAS Y SF 2016-2017 26.000000
Kevin Durant GSW Y SF 2018-2019 26.000000
LeBron James CLE Y SF 2015-2016 25.000000
Karl-Anthony Towns MIN N C 2016-2017 25.000000
Devin Booker PHO N SG 2017-2018 25.000000

Explaination of Heat Map Analysis: Top 50 NBA Player Statistics based on Total Points Per Game

In the top 50 NBA player statistics 36 total points per game is the highest average in the total points category scored by James Harden, and Russell Westbrook scored an average of 32 total points per game as they both were in different teams that made the playoffs, and have been consistent in their performance over the years based on the data. But at the same time Bradley Beal amd Trae Young were also top 4 leading scorers in the total points category in 2019-2020 season but their teams did not make the playoffs. It shows that having good performance in the total points per game category does not guarantee playoffs spot.

But if we look at the broader picture in the heat map data analysis even though there are some players like LeBron James, Anthony Davis, Bradley Bealand Kevin Durant that appear more than twice in the data above because either they played for different teams or their consistently performing in well in multiple position in different years scoring a high average of total points per games. But the main point is there are 16 out of 50 NBA player were not in teams that made the playoffs which 32% of the data. It concludes that are hunch is correct regarding the total points per games analysis above shows that players should be able to score better in both shooting category. To further prove our assumptions on this heat map analysis we will narrow the data down to top 10 NBA players based on average total points per game

Top 10 NBA Player Statistics based on Average Total Points Goals Per Game

In [57]:
players_avg_totalpointsPerGame.sort_values(by=['TotalPointsPerGame_Mean'], ascending = False)[:10]
Out[57]:
TotalPointsPerGame_Mean
Player Team Playoff Position Year
James Harden HOU Y PG 2018-2019 36.0
SG 2019-2020 34.0
Russell Westbrook OKC Y PG 2016-2017 32.0
Bradley Beal WAS N SG 2019-2020 31.0
Trae Young ATL N PG 2019-2020 30.0
Stephen Curry GSW Y PG 2015-2016 30.0
Damian Lillard POR Y PG 2019-2020 30.0
James Harden HOU Y SG 2017-2018 30.0
Isaiah Thomas BOS Y PG 2016-2017 29.0
Luka Dončić DAL Y PG 2019-2020 29.0

Explaination of Heat Map Analysis: Top 10 NBA Player Statistics based on Average Total Points Per Game

Now, that we have narrowed our data down to top 10 NBA players having a high average of total points per game we can see that some of the players have been consistently performing well in their individual statistics like James Harden leading in points in the year 2019 and 2018, also being in same teams that made playoffs. Similarly, Russell Westbrookhave scored an average of 32 total points per game being in different team that made playoffs.

On the other hand, there are players like Bradley Beal who have scored 31 total points per game and Trae Young have scored 30 total points per game in the same regular season 2019-2020 with different team but did not make the playoffs. In addition, the data reveals that 2 out of 10 players from the above data were not on teams that made the playoffs which is 20% of the data above. It further proves that individual players statistics in the total points per game category do correlate to their chances of always making the playoffs but it is also dependent on the overall team statistics.

In [ ]: